Ran across a really clever algorithm for integer multiplication by a power of 10.
C source code:
The first thing to notice is there isn’t a single multiplication. This function is done entirely with shifts (multiplying by powers-of-two) and adds. The function does work, but what exactly is going on?
The algorithm takes advantage of this relationship: 10n = 2n·5n. The first left shift is the multiplication by 2n and the following loop performs the multiply by 5n. The multiply by 2n is fairly obvious, but the loop is more complicated. Start with noting that 5x = 4x + x which is x + (x << 2). This is done n times by the loop. Mathematically what we have is the following equivalence:
The while loop doesn't look like this as written, but keep in mind following lines are identical operations:
Putting it all together the function is:
At the assembly level the function can be implemented in just a few instructions.
Intel x86 (32-bit)
sal eax, cl # value <<= power
lea eax, [eax+eax*4] # value = value + value * 4
sub ecx, 1 # power -= 1
Atmel AVR (8-bit)
mov r0,r22 # value <<= power
# value += value << 2;
subi r22,lo8(-(-1)) # power -= 1
lsl r0, r0, r1
add r0, r0, r0, lsl #2
subs r1, r1, #1
lsl r12, r12, r11
sub r11, 1
add r12, r12, r12 << 2
cp.w r11, 0
The 32-bit assembly is just a few instructions (assuming power is not zero). The 8-bit Atmel assembly is larger for a couple reasons. Firstly, because this is an 8-bit processors, 16-bit integers operations require two instruction for each shift and add. Second, the AVR instruction set can only shift one bit at a time. Although the ARM is a RISC processor as and the Intel CISC, the number of ARM and x86 instructions are identical. Unlike the ARM’s subs instruction, the 32-bit AVR’s sub instruction doesn’t set flags requiring an extra compare instruction. Otherwise all three 32-bit implementations are equal.
While this technique is clever, modern CPUs have very fast multiply instructions. The ARM requires 1 to 2 clocks for a multiply. Intel’s multiplication takes 2 operations when operating on registers (the exact timing is more complex). Thus a loop simply multiplying by 10 would be faster. Still, this algorithm is interesting and was fun to investigate.