12 channel ddr5 5600 ECC is around 500gbs which in real world works very well for large MoE

You mean 500 GB/s, not Gb/s (actually 537 GB/s).

Unfortunately that does not matter. Even in a cheap desktop motherboard the memory bandwidth is higher than of 16-lane PCIe 5.0.

Therefore the memory bandwidth available to a discrete GPU is determined by its PCIe slot, not by the system memory.

If you install multiple GPUs, in many MBs that will halve the bandwidth of the PCIe slots, for an even lower memory throughput.

Talking about dual socket SP5 EPYC with 24 DIMM slots, 128 PCIe 5.0 lanes

It’s fast for hybrid inference, if you get the KV and MoE layers tuned between the Blackwell card(s) and offloading.

We have a prototype unit and it’s very fast with large MoEs

> in many MBs that will halve the bandwidth of the PCIe slots

Not on boards that have 12 channels of DDR5.

But yeah, squeezing an LLM from RAM through the PCIe bus is silly. I would expect it would be faster to just run a portion of the model on the CPU in llama.cop fashion.

It is much faster, yeah. llama.cpp supports swapping between system memory and GPU, but it’s recommended that you don’t use that feature because it’s rarely the right call vs using the CPU to do inference on the model parts in system CPU memory.

Edit: the settings is "GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"... useful if you have unified memory, very slow if you do not.