Hacker News

There was this great post the other day [1] showing that with llama-cpp you could offload some specific tensors to the CPU and maintain good performance. That's a good way to use lare(ish) models in commodity hardware.

Normally with llama-cpp you specifiy how many (full) layers you want to put in GPU (-ngl) . But CPU-offloading specific tensors that don't require heavy computation , saves GPU space without affecting speed that much.

I've also read a paper on loading only "hot" neurons into the cpu [2] . The future of home AI looks so cool!

[1] https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...

[2] https://arxiv.org/abs/2312.12456