> in many MBs that will halve the bandwidth of the PCIe slots
Not on boards that have 12 channels of DDR5.
But yeah, squeezing an LLM from RAM through the PCIe bus is silly. I would expect it would be faster to just run a portion of the model on the CPU in llama.cop fashion.
It is much faster, yeah. llama.cpp supports swapping between system memory and GPU, but it’s recommended that you don’t use that feature because it’s rarely the right call vs using the CPU to do inference on the model parts in system CPU memory.
Edit: the settings is "GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"... useful if you have unified memory, very slow if you do not.