Hacker News

At 16GB a Q4 quant of Mistral Small 3.1, or Qwen3-14B at FP8, will probably serve you best. You'd be cutting it a little close on context length due to the VRAM usage... If you want longer context, a Q4 quant of Qwen3-14B will be a bit dumber than FP8 but will leave you more breathing room. Mistral Small can take images as input, and Qwen3 will be a bit better at math/coding; YMMV otherwise.

Going below Q4 isn't worth it IMO. If you want significantly more context, probably drop down to a Q4 quant of Qwen3-8B rather than continuing to lobotomize the 14B.

Some folks have been recommending Qwen3-30B-A3, but I think 16GB of VRAM is probably not quite enough for that: at Q4 you'd be looking at 15GB for the weights alone. Qwen3-14B should be pretty similar in practice though despite being lower in param count, since it's a dense model rather than a sparse one: dense models are generally smarter-per-param than sparse models, but somewhat slower. Your 5060 should be plenty fast enough for the 14B as long as you keep everything on-GPU and stay away from CPU offloading.

Since you're on a Blackwell-generation Nvidia chip, using LLMs quantized to NVFP4 specifically will provide some speed improvements at some quality cost compared to FP8 (and will be faster than Q4 GGUF, although ~equally dumb). Ollama doesn't support NVFP4 yet, so you'd need to use vLLM (which isn't too hard, and will give better token throughput anyway). Finding pre-quantized models at NVFP4 will be more difficult since there's less-broad support, but you can use llmcompressor [1] to statically compress any FP16 LLM to NVFP4 locally — you'll probably need to use accelerate to offload params to CPU during the one-time compression process, which llmcompressor has documentation for.

I wouldn't reach for this particular power tool until you've decided on an LLM already, and just want faster perf, since it's a bit more involved than just using ollama and the initial quantization process will be slow due to CPU offload during compression (albeit it's only a one-time cost). But if you land on a Q4 model, it's not a bad choice once you have a favorite.

1: https://github.com/vllm-project/llm-compressor