Jump to content

Advantage of ECC RAM?


Recommended Posts

Hi Philip,

 

We use ECC RAM for our file servers at work (I build them myself), but think of it this way--the vast majority (99.9% of computers) sold in the world come with non-ECC memory, and they work just fine. You don't normally hear about people losing data or getting them corrupted because of bad memory. This isn't to say that they don't happen, but I wouldn't sweat it. Having redundant file system (not necessarily RAID, but at least nightly backup) and long-term archival is IMHO much more worthwhile pursuit of your time and effort.

 

And, I would take an exception to the following statement:

 

"computers are such error-prone devices"

 

I happen to think they are remarkably stable, solid products. Now software, including operating systems, that's another issue entirely...

Link to comment
Share on other sites

The purpose of ECC is to prevent data corruption from cosmic rays. These would cause an error about once a month per 256MB of RAM (see <a href="http://www.eetimes.com/news/98/1012news/ibm.html">this article</a>). This isn't a big enough problem for individual use, but if you are running a server that runs for months on end and handles money, you want to be absolutely sure, even if you pay a performance penalty for it.
Link to comment
Share on other sites

Here I use ECC memory for our dedictated upload/download box; that is used for processing and uploading a dedicated valuable customers scans we do. I basically will never get the originals back for a re- scan; and had the extra ECC memory anyway.....We also use ECC memory in our three Fiery/RIP boxes; and two dedicated computers that drive scanners. These Fiery/RIP boxes come with ECC memory from the factory. Most of my other computers have plain jane memory. ECC memory maany times is alot more expensive; surplus ECC memory varies radically in pricing.
Link to comment
Share on other sites

A friend two weeks ago had his computer'f power supply go up in smoke recently; it ruined all three hard disk drives. Two of the hard disk drives were settup in a RAID configuration; as his "safe" backup. After this massive data loss; he is duplicating his data on three different HDA's; located on three separate computers; and also burning Cd's too. His job is only computers.
Link to comment
Share on other sites

OK I think I got my answer thanks. Hyun, I call them error-prone because for a user like me the software, operating systems and hardware are somewhat indistinguishable. Generally I am amazed at the things that occur with PCs. Its ironic, that on one hand we can accomplish so much with a PC and on the other, there are these errors that occur. I guess I thought a couple hundred more investment might get rid of a few errors. I have two computers in the house. One is rock solid (for the most part - yes errors occur with it occasionally) and the other has been one problem after another. In fact I am so gun shy about it I am inclined to not use one component from it when I upgrade.
Link to comment
Share on other sites

I build high-performance computers from scratch for a hobby and for money, and usually

the boards I choose support ECC RAM, but I almost always turn it off (at the BIOS level) and

install plain-Jane RAM.

 

I only leave it on if the machine is headed for the server rack / room. Otherwise, as an

above poster said, it decreases performance slightly.

Link to comment
Share on other sites

  • 10 months later...

It's even more complex than that. Some consumer chipsets will simply ignore the fact that the ECC bytes are available, so you don't really get a performance hit (beyond any more-conservative timings present in the DIMM's SPD ROM or similar), but you don't get any advantage, either. <BR><BR>

 

If your chipset *does* support ECC, single-bit errors will be corrected transparently, while ones across more bits -- from, say, a really energetic cosmic ray -- will not be; the best ECC allows there is *detecting* the error and passing that information on to the OS. To my knowledge, Windows and FreeBSD have no mechanisms for recovering gracefully, so you'll bluescreen or panic, probably losing all your work... but "at least" corruption won't sneak in silently (important in the financial case, etc).<BR><BR>

 

Newer Windows, at least in 'Server' incarnations, might do something better; I'm going on information from the NT era. FreeBSD, on the other hand, can happily accrue years of uptime on the cheapest systems, and I've never actually seen an ECC-related panic on machines that (to my knowledge) have had it enabled. Most of today's tower cases and mainboards put the DIMMs perpendicular to the sky, possibly with a few thicknesses of CD-ROM drives and power supplies above, so single-bit errors are probably reduced, while the occasional reallly lucky particle might have a better chance of strafing through the die and causing multi-bit problems.<BR><BR>

 

The Suns appear to log every single-bit error corrected, while the x86 universe doesn't really have a mechanism for that (for the single bit case, either the chipset handles it, and you never know, or it doesn't, and you never know until you notice corruption somewhere). However, since Sun rolls their own, they might have some design issues that make ECC hits more common than just for cosmic rays. ;) (Seriously, modern memory controllers are complicated beasts, and current ones incorporate some fair black magic to ensure the signals get through reliably; the likes of Intel and Via may be ahead there, since Sun sometimes has a flair for the simple. Then there's the question of 'Taiwanese capacitor syndrome,' something to worry about with almost any hardware manufactured from the late '90s through two years ago, which will often trash mainboards' onboard power regulation and cause all sorts of "flaking out.")<BR><BR>

 

<A HREF="http://www.memtest86.com/">memtest86</A> is Free and indispensible if you want to check your DIMMs (and mainboard, etc) for reliability. (Of course, sans-ECC, you might have the bizarre luck of seeing a cosmic-ray-induced error while you test; recurring errors at a particular address would point to a definite hardware problem.) It's also a great tool if you want to try more aggressive CAS timings or similar than those in your DIMMs' autoconfiguration ROMs. (Note that stability there can also depend on the memory controller; I could never run my RAM at CAS2 on an old Socket 7 board, but when I upgraded to an Athlon with a more modern chipset, the same stick tested flawlessly at the slightly higher speed.)<BR><BR>

 

<A HREF="http://vbulletin.newtek.com/archive/index.php/t-17753.html">This thread elsewhere</A>, and <A HREF="http://groups-beta.google.com/group/sol.lists.freebsd.chat/browse_thread/thread/77afef74db1b8a4c/0501ec78074c32f3?q=ECC+freebsd&_done=%2Fgroups%3Fq%3DECC+freebsd%26hl%3Den%26btnG%3DGoogle+Search%26&_doneTitle=Back+to+Search&&d#0501ec78074c32f3">this one</A> elaborate a little about ECC. One likely example of the failing capacitor syndrome can be found <A HREF="http://www.lemis.com/grog/diary-aug2003.html">here</A>, on his 8 August entry -- note how the originally tightly-regulated "Vc" or "Vcore" voltage has gone to hell. (Changing the power supply wouldn't fix that, and indeed, I think it didn't, though it's been months since I informed *him* of the <A HREF="http://www.geek.com/news/geeknews/2003Feb/bch20030207018535.htm">problem</A>. Anyhow, monitoring the onboard power regulation is a nice methodology to test.)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...