I was evaluating some pseudorando m number generator (PRNG) and started wondering about compiler optimization. Many modern processors have a circular shift instruction, but languages such as C have no direct support for this. As a result, optimization has been designed to recognize circular shifts.

First, a circular shift:

As noted most C compiler recognize this line of code as a circular shift and are able to implement the circular shift instruction rather than using two shifts and an OR.

I analyzed one of the smallest, fastest, and highest entropy quality PRNGs I tested was Xoroshiro128+. It is able to produce 1 TB of random data in 3 minutes (on a 3 GHz Core i7). The algorithm is quite simple:

state1 = state0 ^ state1

state0 = ( state0 <<< 55 ) ^ state1 ^ ( state1 << 14 )

state1 = ( state1 <<< 36 )

All values are 64-bit. The triple less than sign here means a circular shift left. The algorithm has a 128-bit state, meaning a period of 2^{128} (340 undecillion, or 340 trillion trillion trillion). On a native 64-bit machine the assembly is quite small.

64-bit x86 |
64-bit ARM |

mov rax, QWORD PTR [rcx] mov r9, QWORD PTR 8[rcx] mov rdx, rax mov r8, rax add rax, r9 xor rdx, r9 ror r8, 9 mov r10, rdx xor r8, rdx ror rdx, 28 sal r10, 14 mov QWORD PTR 8[rcx], rdx xor r8, r10 mov QWORD PTR [rcx], r8 ret |
mov x3, x0 ldp x2, x0, [x0] eor x1, x2, x0 add x0, x2, x0 eor x2, x1, x2, ror 9 ror x4, x1, 28 eor x1, x2, x1, lsl 14 stp x1, x4, [x3] ret |

Note that although the ARM assembly is less instructions, the x86 assembly can do several steps in parallel. I found this was an interesting exploration of both an algorithm and of modern C compilers.