Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe
There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/
MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop).
Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model
Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.
That blog post was super interesting. It is neat that he can select experts and control the routing in the model—not having played with the models in detail, tended to assume the “mixing” in mixture of experts was more like a blender, haha. The models are still quite lumpy I guess!
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.
This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.