Advantage of ECC RAM?

Discussion in 'Digital Darkroom' started by philip_sweeney, Feb 14, 2004.

  1. should error correcting RAM be considered? I know the cost goes up but
    since computers are such error-prone devices can moving to the use of
    ECC RAM be of great advantage or will the improvement be marginal?
     
  2. Hi Philip,

    We use ECC RAM for our file servers at work (I build them myself), but think of it this way--the vast majority (99.9% of computers) sold in the world come with non-ECC memory, and they work just fine. You don't normally hear about people losing data or getting them corrupted because of bad memory. This isn't to say that they don't happen, but I wouldn't sweat it. Having redundant file system (not necessarily RAID, but at least nightly backup) and long-term archival is IMHO much more worthwhile pursuit of your time and effort.

    And, I would take an exception to the following statement:

    "computers are such error-prone devices"

    I happen to think they are remarkably stable, solid products. Now software, including operating systems, that's another issue entirely...
     
  3. ECC memory can be slower than non-ECC as it has more to do each cycle.
    It isn't really necessary for home or small office use. It is really for when you have large database servers running 24/7 constantly accessing and changing large amounts of memory.
     
  4. The purpose of ECC is to prevent data corruption from cosmic rays. These would cause an error about once a month per 256MB of RAM (see this article). This isn't a big enough problem for individual use, but if you are running a server that runs for months on end and handles money, you want to be absolutely sure, even if you pay a performance penalty for it.
     
  5. Here I use ECC memory for our dedictated upload/download box; that is used for processing and uploading a dedicated valuable customers scans we do. I basically will never get the originals back for a re- scan; and had the extra ECC memory anyway.....We also use ECC memory in our three Fiery/RIP boxes; and two dedicated computers that drive scanners. These Fiery/RIP boxes come with ECC memory from the factory. Most of my other computers have plain jane memory. ECC memory maany times is alot more expensive; surplus ECC memory varies radically in pricing.
     
  6. As others have said, this is useful in 24/7 servers but not in workstations, particularly because error checking reduces performance.
     
  7. A friend two weeks ago had his computer'f power supply go up in smoke recently; it ruined all three hard disk drives. Two of the hard disk drives were settup in a RAID configuration; as his "safe" backup. After this massive data loss; he is duplicating his data on three different HDA's; located on three separate computers; and also burning Cd's too. His job is only computers.
     
  8. OK I think I got my answer thanks. Hyun, I call them error-prone because for a user like me the software, operating systems and hardware are somewhat indistinguishable. Generally I am amazed at the things that occur with PCs. Its ironic, that on one hand we can accomplish so much with a PC and on the other, there are these errors that occur. I guess I thought a couple hundred more investment might get rid of a few errors. I have two computers in the house. One is rock solid (for the most part - yes errors occur with it occasionally) and the other has been one problem after another. In fact I am so gun shy about it I am inclined to not use one component from it when I upgrade.
     
  9. Error correcting / fault tolerant RAM is a solution in search of a problem when installed on a home computer. Unless of course the motherboard requires it.

    The stability of your power supply is a far more critical factor in terms of memory errors.
     
  10. I build high-performance computers from scratch for a hobby and for money, and usually
    the boards I choose support ECC RAM, but I almost always turn it off (at the BIOS level) and
    install plain-Jane RAM.

    I only leave it on if the machine is headed for the server rack / room. Otherwise, as an
    above poster said, it decreases performance slightly.
     
  11. It's overkill and that's an understatement. Keep your money in your pocket...

    :)
     
  12. It's even more complex than that. Some consumer chipsets will simply ignore the fact that the ECC bytes are available, so you don't really get a performance hit (beyond any more-conservative timings present in the DIMM's SPD ROM or similar), but you don't get any advantage, either.

    If your chipset *does* support ECC, single-bit errors will be corrected transparently, while ones across more bits -- from, say, a really energetic cosmic ray -- will not be; the best ECC allows there is *detecting* the error and passing that information on to the OS. To my knowledge, Windows and FreeBSD have no mechanisms for recovering gracefully, so you'll bluescreen or panic, probably losing all your work... but "at least" corruption won't sneak in silently (important in the financial case, etc).

    Newer Windows, at least in 'Server' incarnations, might do something better; I'm going on information from the NT era. FreeBSD, on the other hand, can happily accrue years of uptime on the cheapest systems, and I've never actually seen an ECC-related panic on machines that (to my knowledge) have had it enabled. Most of today's tower cases and mainboards put the DIMMs perpendicular to the sky, possibly with a few thicknesses of CD-ROM drives and power supplies above, so single-bit errors are probably reduced, while the occasional reallly lucky particle might have a better chance of strafing through the die and causing multi-bit problems.

    The Suns appear to log every single-bit error corrected, while the x86 universe doesn't really have a mechanism for that (for the single bit case, either the chipset handles it, and you never know, or it doesn't, and you never know until you notice corruption somewhere). However, since Sun rolls their own, they might have some design issues that make ECC hits more common than just for cosmic rays. ;) (Seriously, modern memory controllers are complicated beasts, and current ones incorporate some fair black magic to ensure the signals get through reliably; the likes of Intel and Via may be ahead there, since Sun sometimes has a flair for the simple. Then there's the question of 'Taiwanese capacitor syndrome,' something to worry about with almost any hardware manufactured from the late '90s through two years ago, which will often trash mainboards' onboard power regulation and cause all sorts of "flaking out.")

    memtest86 is Free and indispensible if you want to check your DIMMs (and mainboard, etc) for reliability. (Of course, sans-ECC, you might have the bizarre luck of seeing a cosmic-ray-induced error while you test; recurring errors at a particular address would point to a definite hardware problem.) It's also a great tool if you want to try more aggressive CAS timings or similar than those in your DIMMs' autoconfiguration ROMs. (Note that stability there can also depend on the memory controller; I could never run my RAM at CAS2 on an old Socket 7 board, but when I upgraded to an Athlon with a more modern chipset, the same stick tested flawlessly at the slightly higher speed.)

    This thread elsewhere, and this one elaborate a little about ECC. One likely example of the failing capacitor syndrome can be found here, on his 8 August entry -- note how the originally tightly-regulated "Vc" or "Vcore" voltage has gone to hell. (Changing the power supply wouldn't fix that, and indeed, I think it didn't, though it's been months since I informed *him* of the problem. Anyhow, monitoring the onboard power regulation is a nice methodology to test.)
     

Share This Page