That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.
That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.
The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.
There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.
Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066
I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.
If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.
I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .
If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data
We are talking about errors happening in user space applications with ECC operating normally and what the application ultimately sees.
My point is that when writing an app you wouldn’t be able to “not use” ECC accidentally or easily if it’s there. It’s just seamless. I’m not talking about special test modes or accessing stuff differently on purpose.
Interesting that DDR5 is different than DDR4. 8 bits for 32 is doubling of 8 for 64 so it must have been warranted.
I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.
We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.