Hacker News

On some workloads, swapping is a bad idea.

The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.

This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.

In the long term, compute is probably going to move towards the memory.

zozbot234 11 hours ago [ - ]

The main blocker with swapping is not even the limited bandwidth, it's actually the extreme write workload on data elements such as the per-layer model activations - and, to a much lesser extent, the KV-cache. In contrast, there are elements such as inactive experts for highly sparse MoE models, where swapping makes sense since any given expert will probably be unused. You're better off using that VRAM/RAM for something else. So the logic of "reserve VRAM for the highest-value uses, use system RAM as a second tier, finally use storage as a last resort or for read-only data" is still quite valid.

rnrn 4 hours ago [ - ]

How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?

The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe

zozbot234 4 hours ago [ - ]

Once your model is large enough you'll have to eat the offload cost for something, and it might as well be something where most of that VRAM footprint isn't even used. For current models, inactive experts arguably fit that description best. Of course, it may be the case that shifting that part of the graph to CPU compute is a better deal than paying the CPU-to-GPU cost for the active weights and computing on GPU; that's how llama.cpp does it.