Andrew Que Sites list Photos
Projects Contact

February 03, 2010

Prime number - Part 2

So my last article on prime numbers led me to ask the question "How many prime numbers are there between 2 and 232". I know the quick answer to this question is "a lot". 232 is over four billion—and that's a fair bit of space to check. It would be easy to make a for-loop and brute force the entire number range, but this is the age of multi-core CPUs—and single threaded is so 2000. Besides, my computer hasn't broken a sweat in awhile.

I decided to do a straight C implementation. I've been doing C and C++ at work, and C++ has been irritating me lately because I tried to do a "quick and dirty" implementation without too much thought into the classes I'd need. That's a recipe for disaster, and how you end up with things like microsoft's windows ME. So vanilla C it was.

Threads and semaphores I used the POSIX standards, and I've tried to demonstrate how to make a dispatcher foreground task with one or more worker threads. The key to this is the dispatch semaphore. Normally, semaphores are used to for mutually exclusive lock, or to send signals between threads. However, they can also be used as a count. Think of them like a library that has a set number (say 5) of books. That means 5 people can check that book out, but if a 6th person wants to check that book out, they will have to wait for one of the first 5 to return their copy. The semaphore does this by keeping a count. When that count reaches zero, the next task that tries to pend the semaphore will have to wait. So the dispatcher is a loop that starts by pending the semaphore. When it gets the semaphore, it will create a task. The task will do some work, then post the semaphore. The semaphore count determines how many tasks can run at the same time. If we limit the number of tasks that can run at once to the number of CPU cores, then we can utilize the CPU to 100%.

First, the dispatcher skeleton code:

  pthread_t Threads[ NUMBER_OF_THREADS ];

  sem_init( &Semaphore, NUMBER_OF_THREADS, NUMBER_OF_THREADS );

  unsigned ThreadIndex = 0;

  while ( NumbersLeft )
    // Wait for a free worker thread.
    sem_wait( &Semaphore );

    // Create a worker thread to check this number set.
      &Threads[ ThreadIndex ],
      (void *)&Data[ ThreadIndex ]

    if ( ThreadIndex >= NUMBER_OF_THREADS )
      ThreadIndex = 0;

Then, the skeleton of the worker thread:

static void * PrimeThread( void * ArgumentPointer )
  // Do prime check.

  // This thread is now complete.  Release one count from the dispatch
  // semaphore.
  sem_post( &Semaphore );

  // End this thread.
  pthread_exit( 0 );

  return 0;


Now just launching a thread to check one number is a bit of a waste—it takes some overhead to start and stop the task. So we give each thread some range of numbers to check. A pthread can be passed parameters, and we pass the starting number, along with the amount of numbers to check. The parameter block is also used for storage of the results. So here is the full worker thread:

static void * PrimeThread( void * ArgumentPointer )
  // Get the work data passed to the thread.
  WORK_TYPE * Data = (WORK_TYPE *)ArgumentPointer;
  uint32_t Number = Data->StartNumber;
  uint32_t Count = 0;
  unsigned Index;

  // For all the numbers to check...
  for ( Index = 0; Index < Data->NumberToCheck; ++Index )
    // Is this number a prime?
    if ( IsPrime( Number ) )
      // Then count it.

    // Next number.

  // Save results.
  Data->NumberFound = Count;

  // This thread is now complete.  Release one count from the dispatch
  // semaphore.
  sem_post( &Semaphore );

  // End this thread.
  pthread_exit( 0 );

  // Never reached--here for language consistency.
  return 0;

} // PrimeThread

The dispatcher has to accumulate the results. Our could try using a global variable, but you can run into issues with non-atomic access. Adding 1 might seem a basic operation, but (at least in a RISC system) it takes 3 instructions to add one to a variable in memory: load the data to a register, add the register by 1, and store the register. You can not (without a semaphore) guarantee that those instructions will not be interrupted by an other thread. So, the accumulation happens in just one place—the dispatch loop.

The dispatcher will assign all the work, but it is finished once all the work has been assigned. We still need to wait for the work to finish. So the last part of the process looks like this:

  for ( ThreadIndex = 0; ThreadIndex < NUMBER_OF_THREADS; ++ThreadIndex )
    if ( Data[ ThreadIndex ].NumberToCheck )
      // Wait for thread to finish.
      pthread_join( Threads[ ThreadIndex ], NULL );

      // Accumulate the number of primes found in this thread.
      NumberOfPrimes += Data[ ThreadIndex ].NumberFound;

  // Let go of dispatch semaphore.
  sem_destroy( &Semaphore );

Now we can print the results to the screen, and we're done.

I knew this code would tax the CPU rather hard, so I wanted to measure how long the process would take. "time.h" is a C standard unit. Unfortunately, "time_spec", which has a high-resolution method for checking the time, it's part of the C99 standard. But we can get time in the resolution of seconds, and that's good enough. To do this, we simply mark the time when the process begins, and mark it again at the end. The difference (for which there is a function to compute) is the total number of seconds that has elapsed.

The answer: There are 203,280,221 primes between 2 and 4,294,967,295, and it took 6,054 seconds (1 hour, 41 minutes) on a quad-core clocked at 2.5 GHz.

The full source code is here.

1 comment has been made.

From Antony (


February 17, 2010 at 3:10 AM

Excellent post! I always enjoy a solid technical post (and code) It saved me a good week - Keep up the good work!

February 02, 2010

Prime numbers - Part 1

So I mentioned yesterday that I wrote a simple program to check to see if some numbers in a list were prime or not. I didn't post it because I wanted to spend a little more time with it—giving it an article on it's own.

Finding prime numbers goes back a long time. The easiest way to see if a number is prime is to simply check and see if the number is divisible by all the prime numbers up to the square root of the number in question.

A number (call it x) is prime if there are no two number (call them a and b) such that a * b = x. All non-prime numbers can be expressed as the product of two or more prime numbers. For example, 125 can be made from 5 * 25, but 25 can be made from 5 * 5. So 5 * 5 * 5 = 125, and represents the most factored version of 125. This is true of any number. Since it takes at least two prime numbers to create a factor, we only need to check the primes up to √x (or the square root) of the number. This is because the if x = a * a, then √x = a. If x = a * b, and b is greater a, then a must be less then √x. Thus, we only need to check primes up to √x.

This makes for a nice brute-force test method, and it's pretty simple to implement. If we're going to test a number, we need all the primes up to the square root of that number. For what I needed, the number was capped at 32-bits—or 232. √232 = (232)1/2 = 216 = 65536. So I need a list of all the prime numbers up to 216. We can make the list fairly quick. Have a look at this function:

enum { NUMBER_OF_LOOKUP_PRIMES = 6542 };
unsigned PrimeNumbers[ NUMBER_OF_LOOKUP_PRIMES ];

void GeneratePrimeNumberLookup()
  unsigned Index;
  unsigned PrimeNumberCount = 0;

  PrimeNumbers[ PrimeNumberCount++ ] = 2;

  for ( Index = 3; Index < 0x10000ul++Index )
    bool IsPrime = true;

    unsigned SubIndex = 0;
    while ( ( SubIndex < PrimeNumberCount )
         && ( IsPrime ) )
      // Does it divide evenly by this prime number?
      if ( 0 == ( Index % PrimeNumbers[ SubIndex ] ) )
        IsPrime = false;
        break// <- We can stop checking.


    if ( IsPrime )
      PrimeNumbers[ PrimeNumberCount++ ] = Index;


We have an array to hold our prime numbers. We set the first entry in the array to two—the first prime number (1 is prime, but who cares). From this, we can test to check the rest of the numbers. We loop from 3 to 216. For each number, we see if any of the primes already in the list will divide evenly into the number under question. If any of them do, they are not prime.

It happens that there are 6542 prime numbers between 2 and 216. Knowing that, we can cap the array size of our lookup table at this value—the number of primes in this range will never change.

We now have enough prime numbers to check any number up to 232. Unlike the function that generates the prime lookup table, we can throw in one more speed improvement: we only need to check up to the square root of the number in question. So here are the guts of the prime test function:

bool IsPrime( uint32_t Number )
  bool IsPrime   = true;
  uint16_t Root  = SquareRoot( Number );
  unsigned Index = 0;

  while ( ( Index < NUMBER_OF_LOOKUP_PRIMES )
       && ( PrimeNumbers[ Index ] <= Root )
       && ( IsPrime ) )
    if ( 0 == ( Number % PrimeNumbers[ Index ] ) )
      IsPrime = false;


  return IsPrime;

Note that can add two short cuts. First, checking up to the root of the number like we discussed. And second, we test to see if the number is even—2 is the only even number that is prime.

That's it. Now we have a simple to implement prime number test that works on numbers up to 232. Here is the full source code to a command line version of the test. Compile it, pass it a number (or several), and the program will tell you if the number is prime or not.


(thanks Erica for the correction)

February 01, 2010

A good 32-bit LFSR polynomial

At work on Friday I came across an interesting math problem. I needed a pseudo random bit stream, and I needed the generator function to be fast. The typical random number function is a Linear congruential generator, most often f(x) = (1103515245 * x + 12345) mod 232-1. The modulus remainder is implied—the arithmetic is simply allowed to overflow. Multiplies are usually expensive in terms of CPU cycles. A faster generator is a linear feedback shift register (LFSR), which are comprised of shifts and XORes. I've used LFSR a number of times when I needed random data, without looking too much at the "random" stream of data it produced.

Using a 32-bit Galois LFSR with the polynomial x32 + x31 + x29 + x + 1, and a seed of 0x12345678 produced this output:

2B3C091A 159E048D 8ACF0246 4566D123
A2B36891 D158E448 68AC7224 34563912
1A2B1C89 8D14DE44 468A6F22 23453791
91A3CBC8 48D0B5E4 24685AF2 12342D79
891A16BC 448D0B5E 2247D5AF 9122BAD7
C8915D6B E449FEB5 F225AF5A 791387AD
BC8893D6 5E4449EB AF2224F5 D791127A
6BC9D93D B5E5BC9E 5AF38E4F AD789727
D6BC4B93 EB5E25C9 F5AF12E4 7AD6D972
3D6B6CB9 9EB4E65C 4F5A732E 27AD3997
93D7CCCB C9EAB665 E4F55B32 727BFD99
B93CAECC 5C9E5766 2E4F2BB3 9726C5D9
CB9362EC 65C8E176 32E470BB 9972385D
CCB91C2E 665DDE17 B32FBF0B D9968F85
ECCB47C2 7664F3E1 BB3279F0 5D993CF8
2ECDCE7C 1767B73E 0BB28B9F 85D945CF
C2EDF2E7 E177A973 F0BA84B9 F85D425C
7C2FF12E 3E16A897 9F0B544B CF84FA25
E7C27D12 73E13E89 B9F1CF44 5CF9B7A2
2E7D8BD1 973F95E8 4B9E9AF4 25CF4D7A
12E6F6BD 89737B5E 44B8EDAF A25C76D7
D12E3B6B E8971DB5 F44ADEDA 7A256F6D
BD13E7B6 5E88A3DB AF4451ED D7A228F6
B6BC9747 DB5E4BA3 EDAF25D1 F6D6C2E8

If you haven't noticed, the pattern "scrolls" digits to the right. This is what one would expect—this is a "shift" register after all. Each number is different, and the period between any two number (except for zero) is 232. However, the pattern between each successive number is fairly obvious. This is because of the polynomial. The shift register only introduces bits 31, 29, 1 and 0 when clocked. Otherwise, output is simply shifted to the right by one. I've arranged the data in 4 columns because each digit (nibble) is 4 bits, so after 4 bits, the nibble is completely shifted right by one.

While this pattern probably would work fine for what I was trying to do, I thought there must be a better solution. A polynomial with more taps should produce a better random stream. But how exactly does one go about generating a polynomial for a shift register?

What is need is a irreducible primitive polynomial. One can put any polynomial (i.e. any number as the polynomial term) into a LFSR they like, but it may not (probably won't) produce a maximum length shift register. That is, a 32-bit word has 232 possibilities. With the correct polynomial, the output of a 32-bit LFSR will produce 232 unique outputs before repeating. Without the correct polynomial, one of two things will happen. The output will happen after less then 232 iterations. Or the output will never repeat and some smaller sub-sequence will repeat.

The polynomial of a LFSR is XORed in with the shifted data. My thought was, if I wanted a better random pattern, we'd better XOR more bits. And not just more bits, but spread out more or less evenly. This should change the stream of data quite a bit.

So how does one create an irreducible primitive polynomial? Good question. I still don't have a complete solution, but I do have part of the solution. I found a program written by a Scott Duplichan that generates primitive polynomials—although not necessarily irreducible. It's designed to generate huge primitive polynomials—on the order of thousands of bits. However, it also would do 32-bit. What was nice about this program was that I could tell it the "weight" I wanted. That is, how many bits I wanted set. The number has to be odd, so I chose to generate polynomials with 15 bits.

There are a lot of primitive polynomials with a weight of 15—millions. And most of them did not have bit patterns that were uniformly distributed. For example, one of the first polynomials produced was 0x00041FFF. While there are 15-bits set, there are pretty much all on one side. The program didn't have a way to tell it to generate polynomials that had uniform distribution, but it did have the ability to find a set number of polynomials. That was good enough for me. I set the program off to find ten million primitive polynomials, and went off to do something else. Just over an hour latter (4152 seconds actually), I had 10 million primitive polynomials with 15-bits set.

The next task was to find polynomials with a more uniform bit distribution. To do this, I wiped up a filter program to only save polynomials that had, at most, 3-bits of the same bits in a row. This significantly reduced the number of polynomials. But I still did not have a list of irreducible primitive polynomials. I started reading about how to test for this, but it was taking too much time. It would be easier to brute force this problem. I wanted a polynomial that was maximum length—that is, produced 232 unique values before repeating. Any polynomial that did this was what I looking for.

It only takes a few seconds to brute-force check all 232 possibilities for a LFSR. With the bit distribution restrictions, there were only 9,441 of the original ten million polynomials to check. So, I launched a brute-force attack, and went home for the weekend.

When I came back this morning, I had a list of 40 irreducible primitive polynomials. I'm not sure how long it took to complete this search, but that was irrelevant—I had my data.

Forty was still more polynomials then I needed. So I decided to reduce the list a little further. I just wanted prime polynomials. This is an other brute-force test. Any number is prime if all none of the prime number up to the square root of the number divide evenly into it. My list was reduced from 40 to 10. I picked the first number from the list, but any would have worked.

Again, my 32-bit Galois LFSR, this time with the polynomial 0x19253292B, and a seed of 0x12345678 produced this output:

2B3C091A 159E048D 8ACF0246 6C4C9370
362649B8 1B1324DC 24A2803D 9251401E
6003B25C 192ACB7D 8C9565BE 6F61A08C
1E9BC215 A666F359 D33379AC 40B2AE85
A0595742 7907B9F2 15A8CEAA 0AD46755
856A33AA 42B519D5 88719EB9 ED13DD0F
DFA2FCD4 6FD17E6A 1EC3AD66 264AC4E0
13256270 20B9A36B B977C3E6 7590F3A0
3AC879D0 1D643CE8 0EB21E74 07590F3A
2A8795CE 3C68D8B4 1E346C5A 0F1A362D
878D1B16 6AED9FD8 1C5DDDBF A705FC8C
7AA9EC15 947FE459 E314E07F F18A703F
F8C5381F D5498E5C 438FD57D 88ECF8ED
C4767C76 623B3E3B 98368D4E 4C1B46A7
8F26B100 47935880 0AE2BE13 85715F09
EB93BDD7 DCE2CCB8 6E71665C 1E13A17D
A622C2ED D3116176 40A3A2E8 097AC327
84BD6193 EB75A29A 5C91C31E 0763F3DC
2A9AEBBD 954D75DE 638DA8BC 18EDC60D
A55DF155 FB85EAF9 D4E9E72F C35FE1C4
4884E2B1 A4427158 522138AC 003B8E05
A936D551 D49B6AA8 4366A707 A1B35383
F9F2BB92 7CF95DC9 9757BCB7 E280CC08
71406604 38A03302 1C501981 8E280CC0
47140660 238A0330 11C50198 21C9929F

And this looks much better. But just because you can't see a pattern doesn't mean one doesn't exist. In fact, one obviously exists—the data is from an equation. But without the equation, how good does the random data look? The quickest way to find out if data is random is to try and compress it. If it compresses, it isn't very random. For example, with 256 MBs of data from the original polynomial, we gets a stream that 7-zip could compress 54% of it's original size. With the new polynomial, 7-zip was only able to compress it to 99.964% of it's original size.

Compression isn't the best judge of a file's randomness. I knew there were better algorithms out for generating an entropy number. Doing a quick search, I found this program originally written by John Walker—one of the co-founders of AutoCAD. It dates back to 1985, but the nice thing about math (unlike computer hardware) is it still works fine after 25 years.

The program produces the chi-square distribution, which is given as percentage. When this percentage is close (or near) to 0% or %100, the data stream is said to not be very random—even 95% or 5% are suspect. The chi-square distribution of the first data set is 0.01%, or not even close to random. The chi-square distribution for the new polynomial is 38.58%. For comparison, the chi-square distribution of some output of the XTEA encryption algorithm in a feedback loop had chi-square distribution of 26.5%. Encryption algorithms desire to produce highly random output.

So our new polynomial of 0x9253292B produces a good psudo-random data stream when used in a 32-bit LFSR.

   It might be a while before I post pictures.  This evening, my camera started acting really strange.  As soon as I put the battery in it, the shuttle started snapping--even with the camera turned off.  I tried several things, but nothing seemed to work.  Once and awhile, I got the message "Err 99".  After some reading on the web, I found this is a fairly common problem with the 20D when the shutter system dies.  Seems the average cost of repair is around $200.  So the question is, repair or upgrade?

4 comments have been made.

From software development in Surrey (

January 19, 2010 at 4:03 AM

That was an inspiring post, Cool piercings, Thanks for writing, most people don't bother.

From Talon

USS Enterprise The Mobile Chernoble

January 27, 2010 at 6:30 AM

UPGRADE! (if it isn't too late for my opinion)

From Erica Too Lazy Too Log in

January 31, 2010 at 5:27 PM

Moar liek "Dies Irae," amirite?

From Application developers in Sri Lanka (

Sri Lanka

July 10, 2016 at 4:10 AM

Not sure how we can join as well?

   Happy New Year!  Picture is Kelly and myself--not sure who took the shot though.

1 comment has been made.

From Steve

JanesHell, WI

January 28, 2010 at 3:58 PM

I'm thinking it might have been Crystal who took this shot. I know she was playing around with your camera at some point that night.


   Met up with Courtney in Madison, and latter Aidan.  We had some East African food for dinner, which I had never tried before.  But I was an instant fan, and will have to try that again sometime.