Hacker News

I'd suggest buying a better GPU, only because all the models you want need a 24GB card. Nvidia... more or less robbed you.

That said, Unsloth's version of Qwen3 30B, running via llama.cpp (don't waste your time with any other inference engine), with the following arguments (documented in Unsloth's docs, but sometimes hard to find): `--threads (number of threads your CPU has) --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407 --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along with the other arguments you need.

Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF (since you have 16GB, grab Q3_K_XL, since it fits in vram and leaves about 3-4GB left for the other apps on your desktop and other allocations llama.cpp needs to make).

Also, why 30B and not the full fat 235B? You don't have 120-240GB of VRAM. The 14B and less ones are also not what you want: more parameters are better, parameter precision is vastly less important (which is why Unsloth has their specially crafted <=2bit versions that are 85%+ as good, yet are ridiculously tiny in comparison to their originals).

Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3

bigyabai 3 days ago [ - ]

> only because all the models you want need a 24GB card

???

Just run a q4 quant of the same model and it will fit no problem.

DiabloD3 3 days ago [ - ]

Q4_K_M is the "default" for a lot of models on HF, and they generally require ~20GB of VRAM to run. It will not fit entirely on a 16GB card. You want to be about 3-4GB VRAM on top of what the model requires.

A back of the envelope estimate of specifically unsloth/Qwen3-30B-A3B-128K-GGUF is 18.6GB for Q4_K_M.