November 06, 2009
The move

On the 18th, I wrote about a simple rows/columns error correcting code (ECC) that could be used to correct a single bit error, and detect a double bit error. This method could miss double bit errors if one error happened in the rows parity, and the other in the columns parity. But it could always correct a single bit error. I mentioned Hamming Codes as a better method. So what are they and how do they work?
Hamming Codes are a set of overlapping parity bits, setup such that when a signal error occurs, the parity bits effected will add up to the incorrect bit. The parity bits are places at locations that are even powers-of-2. Each parity bit includes a limit set of data bits.
Here is an example of plain Hamming Codes used to correct any single bit of error. Note that if two or more errors exist, the correction does not work, nor can the error be detected.
The above demo shows a (15,11) Hamming Code. There are 11 data bits (d1-d11) and 4 parity (p1-p4) bits. Data in are the payload—the data one wishes to protect. With parity shows the data encoded with the parity bits. Note how each parity bit lands on an even power of 2 and the data is in between. Mod parity is a field where the data can be modified. Start by setting the data bits to some desired pattern (they start randomly). As the bits change, the data in With parity will be updated accordingly. Then hit Update parity. This will move the data from With parity to Mod parity. The data in With parity can be manually manipulated so errors can be introduced. Calc parity shows the the 4 parity bits under their respective locations. If everything is correct, these bits will all be zero. In the event of an error, one or more of these bits will be set. The double column in Calc parity corresponds to the error bit in a 1-bit error situation. This number will correspond to the column numbers at the top of the table—so if bit 3 is changed, the error will be 3 and p1 and p2 will be set. The calculation is simply in binary: p1-p4 reflect the error bit. So long hand, the error bit number is p1 * 1 + p2 * 2 + p3 * 4 + p4 * 8.
Playing with the above setup should quickly reveal one flaw: any more then 1-bit error introduction usually results in a "corrected" state, but will the wrong output. This happens because the Hamming Code can only deal with a 1 bit error.
The solution is to simply add one more parity bit that covers all the Hamming Code data. Now, should a 1-bit error occur, the same process as above takes place and the error corrected. But should a 2-bit error occur, the correction for a 1-bit error takes place, but the "corrected" data will still fail the finial parity check—always. Try it...
Note the only difference is the additional bit in column 16. This is simply a parity over bits 1 through 15 after the Hamming Parity bits have been calculated. It could also just be calculated over the data bits (which would work better in hardware implementations). It doesn't matter which method is used, because the results are the same: any 1-bit error can be corrected by the Hamming Code, and any 2-bit error detected.
More then 2-bits of error have no guarantee. They may be seen as 1-bit correctable errors, 2-bit uncorrectable or just not seen at all. It all depends on what bits are effected. This is the limitation of this error correcting code. No more then 2-bits of error every 16 bits can be detected, and no more then 1-bit of error every 16-bits can be detected. And the cost is an additional 5 bits check for every 11 data bits.
So how does this system compare to my row/column method? For my row/column method, it takes 16 parity bits for every 64 bits of data, so a 1.125x the data. This Hamming Code with additional parity takes 5 parity bits for 11 data bits, or ≈1.45x the data. However, Hamming Codes become more efficient with larger sets of data. A comparable set of data would be to use a (71,64) Hamming Code with parity (so 72 bits total), which is 1.125x&mdashl;so just as efficient. However, the rows/column system suffers from one problem—a two bit error where one error is in the columns parity and one in the rows parity will look like a single bit correctable error. The Hamming Code method will always catch a two bit error.
Are there methods to fix more then 1-bit of error, or catch more then 2 faulty bits? Sure—several. We could just add an other hamming code on top of the first one. Each time this is done, 1 more bit of correctable error is added. We can also add more overall parity bits if we're just looking to detect more uncorrectable errors. But there are better systems for this.
The question for is, how many errors are you expecting, and what is their frequency? In a home computer's random access memory, the likely hood of an error is very small and not likely to happen in bursts. PC memory is susceptible to cosmic rays—no joke. On a CD-ROM, errors are much more likely to happen in bursts because of scratches or some other surface interference. If your transmission medium is a modem and a bad phone line, or vast stretches of space when communicating to a distant spacecraft like Cassini-Huygens, the requirements are much more difficult to generate.
I've been curious as to how PC deal with ECC errors, but so far, I haven't found any information. With the PowerPC CPU I've been working on, interrupts can be generated for both correctable and uncorrectable errors. When a correctable error happens, all that is required is that the bad memory location be written to with the correct data. So simply reading the data and writing it back will do this. You can ignore it, but then every time that location is read, an ECC correctable error takes place, and you risk having an other bit change, making the location an uncorrectable error. For our application, uncorrectable errors are an end-game. If this happens, it's best just to restart the system--who knows what that memory belonged to. However, large multitasking system (like one's home PC) could identify which thread the memory belong to and force the thread to restart. What actually happen in operating systems like Linux, I haven't a clue. But there's a lot of possibility.