Hacker News

It is much faster, yeah. llama.cpp supports swapping between system memory and GPU, but it’s recommended that you don’t use that feature because it’s rarely the right call vs using the CPU to do inference on the model parts in system CPU memory.

Edit: the settings is "GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"... useful if you have unified memory, very slow if you do not.