Ran across a really clever algorithm for integer multiplication by a power of 10.
C source code:
{
value <<= power;
while ( 0 != power )
{
value += value << 2;
power = 1;
}
return value;
}
The first thing to notice is there isn’t a single multiplication. This function is done entirely with shifts (multiplying by powersoftwo) and adds. The function does work, but what exactly is going on?
The algorithm takes advantage of this relationship: 10^{n} = 2^{n}·5^{n}. The first left shift is the multiplication by 2^{n} and the following loop performs the multiply by 5^{n}. The multiply by 2^{n} is fairly obvious, but the loop is more complicated. Start with noting that 5x = 4x + x which is x + (x << 2). This is done n times by the loop. Mathematically what we have is the following equivalence:
The while loop doesn't look like this as written, but keep in mind following lines are identical operations:
value += 4 * value;
value = value + 4 * value;
value = value * ( 1 + 4 );
value = value * 5;
Putting it all together the function is:
At the assembly level the function can be implemented in just a few instructions.
Intel x86 (32bit) sal eax, cl # value <<= power
L4: lea eax, [eax+eax*4] # value = value + value * 4 sub ecx, 1 # power = 1 jne L4 
Atmel AVR (8bit) mov r0,r22 # value <<= power
rjmp 2f 1: lsl r24 rol r25 2: dec r22 brpl 1b .L4: # value += value << 2; mov r18,r24 mov r19,r25 lsl r18 rol r19 lsl r18 rol r19 add r24,r18 adc r25,r19 subi r22,lo8((1)) # power = 1 brne .L4 
lsl r0, r0, r1
.L2: add r0, r0, r0, lsl #2 subs r1, r1, #1 bne .L2 

lsl r12, r12, r11
.L2: sub r11, 1 add r12, r12, r12 << 2 cp.w r11, 0 brne .L2 
The 32bit assembly is just a few instructions (assuming power is not zero). The 8bit Atmel assembly is larger for a couple reasons. Firstly, because this is an 8bit processors, 16bit integers operations require two instruction for each shift and add. Second, the AVR instruction set can only shift one bit at a time. Although the ARM is a RISC processor as and the Intel CISC, the number of ARM and x86 instructions are identical. Unlike the ARM’s subs instruction, the 32bit AVR’s sub instruction doesn’t set flags requiring an extra compare instruction. Otherwise all three 32bit implementations are equal.
While this technique is clever, modern CPUs have very fast multiply instructions. The ARM requires 1 to 2 clocks for a multiply. Intel’s multiplication takes 2 operations when operating on registers (the exact timing is more complex). Thus a loop simply multiplying by 10 would be faster. Still, this algorithm is interesting and was fun to investigate.