How often do ECC-correctable single-bit errors occur and how about double/multi-bit errors?

Fact is: DRAM components are not perfect. Some databits inside every DRAM will flip from 0 to 1 or from 1 to 0 from time to time. There are multiple analyses and statistics about how often bit-flips in DRAMs occur, but none of them can be used universally for all applications. One interesting research comes from the University of Toronto, which is called 'DRAM Errors in the Wild - A large scale field study'. This study monitored the DRAM errors in the thousands of systems of the famous Google server-farm for a period of 2 1/2 years. All those servers were surely perfectly air-conditioned, dust-free and protected from radiations of all kinds. Still they came to the result of 25,000 to 70,000 FIT (failures per billion device hours) of 'ECC correctable errors' per Megabit of DRAM. This converts into an average of one single-bit-error every 14 to 40 hours per Gigabit of DRAM.

The field study also explains that the error-rate increases by the age of the memory. Brand new DRAMs might not show any errors for weeks and months, but then the error-rate suddenly goes up.
Uncorrectable errors could be double- or multi-bit errors or complete functional fails of the DRAM. These can all not be corrected, but are extremely rare.

A 1 Gigabit ECC DRAM contains 16 Million blocks of 64 bit datawords. Per each of these 64 bit words, one error is correctable. In other words: Statistically one out of 16 million hits might be a double-bit error. If one error hits per day, this would mean that it takes hypothetically 16 Million days or 48000 years for a double-bit error to hit. But this is just the maths. Finally the real numbers depend on the stress and the environment the application is running in.

FAQ Index