You can already do this with some GPU drivers:

  GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdttm.pages_limit=5242880 ttm.pages_limit=5242880"
One downside is your kernel isn't going to reserve that memory away from userland. You will still see all the memory at system level as "free". As the GPU driver starts using it, other apps/the OS will try to use the "free" memory, not knowing how much of it is in use (it may show up as "cache", or not at all). Then OOM killer starts going or programs start crashing, and at some point the OS tips over or GPU driver crashes. You can add loads of swap as a compromise and it works okay, if a bit slow.

In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.

The point is not how fast it is now. The point is that this opens new possibilities that can be built on. Potentially models that are trained with slightly different architectures to optimize to this use case. Possibly others come to improve this path. Possibly HW manufacturers make a few small adjustments that remove bottlenecks. Who knows, the next person may combine CPU compute with this mem sharing to get another token a second. Then the next person does predictive loading into memory to keep that bandwith 100% maxed and usable. Then the next does and the next does. Before you know it there is a real thing there that never existed.

This is a great project. I love the possibilities it hints at. Thanks for building it!

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.

You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest

On some workloads, swapping is a bad idea.

The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.

This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.

In the long term, compute is probably going to move towards the memory.

The main blocker with swapping is not even the limited bandwidth, it's actually the extreme write workload on data elements such as the per-layer model activations - and, to a much lesser extent, the KV-cache. In contrast, there are elements such as inactive experts for highly sparse MoE models, where swapping makes sense since any given expert will probably be unused. You're better off using that VRAM/RAM for something else. So the logic of "reserve VRAM for the highest-value uses, use system RAM as a second tier, finally use storage as a last resort or for read-only data" is still quite valid.

How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?

The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe

Once your model is large enough you'll have to eat the offload cost for something, and it might as well be something where most of that VRAM footprint isn't even used. For current models, inactive experts arguably fit that description best. Of course, it may be the case that shifting that part of the graph to CPU compute is a better deal than paying the CPU-to-GPU cost for the active weights and computing on GPU; that's how llama.cpp does it.

> You’re basically stating that swapping is also a bad idea.

Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.

Try turning swap off and really find out if you’re not grateful for it. Might be fine if you’re never using all your RAM, but if you are, swap off isn’t fun and you might realize you’ve been unconsciously grateful this whole time. ;) Swap might be important for GPU usage even when not using something like greenboost, since display GPUs sometimes use system RAM to back the GPU VRAM.

> Try turning swap off and really find out if you’re not grateful

Er, I did exactly this over a decade ago and never looked back. It's literally one of the first things I do on a new machine.

> Might be fine if you’re never using all your RAM

That's definitely happened occasionally, and no, swap almost always just makes it worse. The thrashing makes the entire machine unusable instead of making the allocating app(s) potentially unstable. I've recovered most times by just immediately killing the app I'm using. And in fact I have warnings that sometimes tell me fast enough before I reach the limit to avoid such issues in the first place.

If you've used any unreserved VM ever you're grateful for swapping.

Somewhat indirectly but still.

Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows

Does this work on the open source amdgpu drivers ?

I've been a bit too busy to turn mine on for a while.

The OSS AMDGPU drivers by default allocate a fixed percentage of system RAM for GTT (up to 75%), they do not automatically use the entire system memory. You can override this with the kernel options I posted in my original comment, but as I mentioned, there are some serious negative consequences. You also may need to disable IOMMU or use PT mode. Personally I have had a lot of crashes as a result of this stuff, so I went back to the defaults and just don't run big models.

The biggest factor of whether AMD GPUs on Linux are a PITA or not is ROCm. Strix Halo is supported in ROCm 6.x, so it should be supported on most platforms (I haven't tested it tho). ROCm 7.x is supposed to be better but not all apps support it yet.

AMD, if you're reading this, please hire more SWEs. Nvidia will continue to dominate until you beat them at software.

I’ve had no issues running GPT-OSS 120b with decent performance on the machine (HP Zbook Ultra G1a). Running on Bluefin/Universal Blue and Windows.

It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.

It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.

Yes, swapping is a bad idea.

Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.

> It’s architecturally not a good approach.

Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:

- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.

- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in

- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.

These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

From my experience, accessing system RAM from the GPU is so slow, it might as well count as "does not work". It's orders of magnitudes faster to memcpy large swaths of memory that you are going to use to the GPU, rather than accessing system mem from a kernel which then takes ages to wait for that small block/page of memory, then waits again for the next small page/block of memory, etc. Latency hiding doesnt work anymore if the latency is that large.

You’re right for some workloads, but not all of them. The same could have been said for disk swap since the beginning though, and people still found it valuable. Disk swapping with spinning drives did used to be multiple orders of magnitude slower than RAM. But it prevented applications or the system from crashing.

Using system memory from the GPU isn’t that bad if your compute is high enough and you don’t transfer that much data. There are commercial applications that support it and only see low 2-digit percentage perf impact and not the multiples you might expect. Plus on Windows on Nvidia hardware, the driver will automatically use system memory if you oversubscribe VRAM, and I believe this was introduced to support running Stable Diffusion on smaller GPUs.

But then you can use CPU/RAM offload, which already allows you to offload without a kernel module.

[dead]

With discrete GPUs, using system RAM is slow not due to mem bandwidth, but due to PCIe bandwidth, which is the bottleneck.

For example, 16x PCIe 4.0: 256 Gb/s, 16x PCIe 5.0: 512 Gb/s, while 2x DDR5-6400 DIMMs: 819 Gb/s. The actual throughput is lower for both PCIe and DDR5, due to communication overhead.

On server/workstation motherboards which may have 4, 8 or 12 DIMMs instead of 2, the ratio between memory bandwidth and PCIe bandwidth becomes proportionally higher, so the memory throughput achievable by the GPU becomes a very small fraction of the system memory bandwidth.

The difference between DDR4 and 5 is quite substantial. I have a fully loaded Cascade Lake Mac Pro - 6 channels of DDR4-2933 gets me to about 120GB/s or 960Gb/s. PCIe 3.0 is a major Achilles heel of what would be a capable workstation system with modern nvidia GPUs precisely for the reason you document.

Maybe then this is a forward thinking feature for when we (maybe) get improved GPU hardware slots?

edit: Are you sure PCI-E is even that fast? Looking at the chart on Wikipedia (did not research further - so grain of salt here) shows much lower throughput

> slow not due to mem bandwidth, but due to PCIe bandwidth, which is the bottleneck.

> On server/workstation motherboards ... the memory throughput [to system RAM] achievable by the GPU becomes a very small fraction of the system memory bandwidth.

Yes, this is a critical point. It means that this is only realistically useful for prefill, which is compute- and not memory-bandwidth bound.

Sorry, I'm a bit of a noob on llm. What is "prefill"? As opposed to what?

Prefill - module computes KV cache over input toks, up to the last token in your input (the 'prompt'), at which point it can then begin -

Decode - the model chooses a new token to append to the end of the current token list (i.e. it generates a token), then computes the new tokens KVs.

Decode is basically prefill 1 tok -> add 1 tok -> prefill 1 more tok -> ....

but in the initial prefill stage it doesn't need to do generation, since you've provided the toks.

And Incidentally prefill would also be how caching,say, a system prompt saves you some $ for API usage with LLM providers. They only compute the kv cache for the new tokens after the system prompt.

> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques

So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".

The best use is actually for a layer that "almost fits" into VRAM, such that automated offloading to system RAM will be rare enough that it doesn't impact performance.

As in when your secondary memory is fast enough, after the first 10% of the model are processed you can swap their memory with the part for 50% to 60% and when that is done you swap back to have the 0-10% ready in time for the next iteration?

Sounds ambitious, for the small improvement in effective capacity. In particular when I start wondering if real life speed differences would be small enough for that 10% increase, or if it would be even smaller. And that's before factoring in power/cooling cost for saturating another interface.

12 channel ddr5 5600 ECC is around 500gbs which in real world works very well for large MoE

You mean 500 GB/s, not Gb/s (actually 537 GB/s).

Unfortunately that does not matter. Even in a cheap desktop motherboard the memory bandwidth is higher than of 16-lane PCIe 5.0.

Therefore the memory bandwidth available to a discrete GPU is determined by its PCIe slot, not by the system memory.

If you install multiple GPUs, in many MBs that will halve the bandwidth of the PCIe slots, for an even lower memory throughput.

Talking about dual socket SP5 EPYC with 24 DIMM slots, 128 PCIe 5.0 lanes

It’s fast for hybrid inference, if you get the KV and MoE layers tuned between the Blackwell card(s) and offloading.

We have a prototype unit and it’s very fast with large MoEs

> in many MBs that will halve the bandwidth of the PCIe slots

Not on boards that have 12 channels of DDR5.

But yeah, squeezing an LLM from RAM through the PCIe bus is silly. I would expect it would be faster to just run a portion of the model on the CPU in llama.cop fashion.

It is much faster, yeah. llama.cpp supports swapping between system memory and GPU, but it’s recommended that you don’t use that feature because it’s rarely the right call vs using the CPU to do inference on the model parts in system CPU memory.

Edit: the settings is "GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"... useful if you have unified memory, very slow if you do not.

Would MoE models work better with this approach?