Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.
You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
On some workloads, swapping is a bad idea.
The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.
This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.
In the long term, compute is probably going to move towards the memory.
The main blocker with swapping is not even the limited bandwidth, it's actually the extreme write workload on data elements such as the per-layer model activations - and, to a much lesser extent, the KV-cache. In contrast, there are elements such as inactive experts for highly sparse MoE models, where swapping makes sense since any given expert will probably be unused. You're better off using that VRAM/RAM for something else. So the logic of "reserve VRAM for the highest-value uses, use system RAM as a second tier, finally use storage as a last resort or for read-only data" is still quite valid.
How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?
The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe
Once your model is large enough you'll have to eat the offload cost for something, and it might as well be something where most of that VRAM footprint isn't even used. For current models, inactive experts arguably fit that description best. Of course, it may be the case that shifting that part of the graph to CPU compute is a better deal than paying the CPU-to-GPU cost for the active weights and computing on GPU; that's how llama.cpp does it.
> You’re basically stating that swapping is also a bad idea.
Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.
Try turning swap off and really find out if you’re not grateful for it. Might be fine if you’re never using all your RAM, but if you are, swap off isn’t fun and you might realize you’ve been unconsciously grateful this whole time. ;) Swap might be important for GPU usage even when not using something like greenboost, since display GPUs sometimes use system RAM to back the GPU VRAM.
> Try turning swap off and really find out if you’re not grateful
Er, I did exactly this over a decade ago and never looked back. It's literally one of the first things I do on a new machine.
> Might be fine if you’re never using all your RAM
That's definitely happened occasionally, and no, swap almost always just makes it worse. The thrashing makes the entire machine unusable instead of making the allocating app(s) potentially unstable. I've recovered most times by just immediately killing the app I'm using. And in fact I have warnings that sometimes tell me fast enough before I reach the limit to avoid such issues in the first place.
If you've used any unreserved VM ever you're grateful for swapping.
Somewhat indirectly but still.
Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows
Does this work on the open source amdgpu drivers ?
I've been a bit too busy to turn mine on for a while.
The OSS AMDGPU drivers by default allocate a fixed percentage of system RAM for GTT (up to 75%), they do not automatically use the entire system memory. You can override this with the kernel options I posted in my original comment, but as I mentioned, there are some serious negative consequences. You also may need to disable IOMMU or use PT mode. Personally I have had a lot of crashes as a result of this stuff, so I went back to the defaults and just don't run big models.
The biggest factor of whether AMD GPUs on Linux are a PITA or not is ROCm. Strix Halo is supported in ROCm 6.x, so it should be supported on most platforms (I haven't tested it tho). ROCm 7.x is supposed to be better but not all apps support it yet.
AMD, if you're reading this, please hire more SWEs. Nvidia will continue to dominate until you beat them at software.
I’ve had no issues running GPT-OSS 120b with decent performance on the machine (HP Zbook Ultra G1a). Running on Bluefin/Universal Blue and Windows.
It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.
It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.
Yes, swapping is a bad idea.
Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.