Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.

The hedging technique is a cool demo too, but I’m not sure it’s practical.

At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.

Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.

Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.

> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.

This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.

This might be useful in a case where a small lookup or similar is often pushed out from cache such that lookups are usually cold. Yet lookup data might by small enough to not cause issue with cache pollution, increased bandwidth or memory consumption.

In this case it’s better to asynchronously bring the data into the cache, which you can do with a prefetch shortly before the read.

Perhaps. Then again, if your target is to reduce dram refresh induced latency, you might not have time to prefetch either.

It could be massively improved with a special CPU instruction for racing dram reads. That might make it actually useful for real applications. As it is, the threading model she used here would make it incredibly difficult to use this in a real program.

There’s no point racing DRAM reads explicitly. Refreshes are infrequent and the penalty is like 5x on an already fast operation, 1% of the time.

What’s better is to “race” against cache, which is 100x faster than DRAM. CPUs already of do this for independent loads via out-of-order execution. While one load is stalled waiting for DRAM, another can hit the cache and do some compute in parallel. It’s all already handled at the microarchitectural level.

There are already systems that do this in hardware. Any system that has memory mirroring RAS features can do this, notably IBM zEnterprise hardware, you know, the company that this video promoter claims to be one-upping.

I don't think memory mirring features available today allow you to race two DRAM accesses and use the result that returns earlier?

The memory controller sends the read to the DIMM that is not refreshing. It is invisible to software, except for the side-effect of having better performance.

Mirroring is more of a reliability feature though, no? From my understanding it’s like RAID where you keep multiple copies plus parity so uncorrectable errors aren’t catastrophic. Makes sense for mainframes which need to survive hardware failures.

Refresh avoidance is a tangential thing the memory controller happens to be able to do in a scheme like that, but you’d really have to be looking at it in a vacuum to bill it as a benefit.

Like I said, it’s all about cache. You’re not going to DRAM if you actually care about performance fluctuations at the scale of refresh stalls.

Clearly, hitting a cache would be the better outcome. The technique suggested here could only apply to unavoidably cold reads, some kind of table that's massive and randomly accessed. Assume it exists, for whatever reason. To answer your question, refresh avoidance is an advertised benefit of hardware mirroring. Current IBM techno-advertising that you can Google yourself says this:

"IBM z17 implements an enhanced redundant array of independent memory (RAIM) design with the following features: ... Staggered memory refresh: Uses RAIM to mask memory refresh latency."

I can google, thanks. My point is that nobody is buying mainframes with redundant memory to avoid refresh stalls. It’s a mostly irrelevant freebie on hardware you bought for fault tolerance.

> clear spikes from 70ns to 330ns

Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.

Yeah exactly, and it’s absolutely dwarfed by the tail latency of going to DRAM in the first place. A cache miss is a 100x tail event vs. an L1 hit. The refresh stall is a further 5x on top of that, which barely registers if you’re already eating the DRAM cost.

On most RAM tREF can be increased a lot from the default, at least if kept somewhat cool.

It is not only not practical, it is a completely useless technique. I got downvoted to negative infinity for mentioning this, but I guess I am the only person who actually read the benchmark. The reason the technique "works" in the benchmark is that all the threads run free and just record their timestamps. The winner is decided post hoc. This behavior is utterly pointless for real systems. In a real system you need to decide the winner online, which means the winner needs to signal somehow that it has won, and suppress the side effects of the losers, a multi-core coordination problem that wipes out most of the benefit of the tail improvement but, more importantly, also massively worsens the median latency.

Man. You really don't get it do you.

You got downvoted for being an asshole, and if you continue to be an asshole on HN we are going to ban you. I suppose you don't believe this because we haven't done it yet even after countless warnings:

https://news.ycombinator.com/item?id=43850950 (April 2025)

https://news.ycombinator.com/item?id=43847946 (April 2025)

https://news.ycombinator.com/item?id=42096833 (Nov 2024)

https://news.ycombinator.com/item?id=37275963 (Aug 2023)

https://news.ycombinator.com/item?id=35746140 (April 2023)

https://news.ycombinator.com/item?id=34537078 (Jan 2023)

https://news.ycombinator.com/item?id=33914274 (Dec 2022)

https://news.ycombinator.com/item?id=33311881 (Oct 2022)

https://news.ycombinator.com/item?id=30890360 (April 2022)

https://news.ycombinator.com/item?id=26628758 (March 2021)

https://news.ycombinator.com/item?id=26307811 (March 2021)

https://news.ycombinator.com/item?id=25561372 (Dec 2020)

https://news.ycombinator.com/item?id=24724281 (Oct 2020)

https://news.ycombinator.com/item?id=24458954 (Sept 2020)

https://news.ycombinator.com/item?id=24380545 (Sept 2020)

https://news.ycombinator.com/item?id=23170477 (May 2020)

The reason we haven't banned you yet is because you obviously know a lot of things that are of interest to the community. That's good. But the damage you cause here by routinely poisoning the threads exceeds the goodness that you add by sharing information. This is not going to last, so if you want not to be banned on HN, please fix it.

https://news.ycombinator.com/newsguidelines.html