Hacker News

The best place to look is HuggingFace

Qwen is pretty good and in a variety of sizes. I’d suggest this one Qwen/Qwen3-14B-GGUF Q4_K_M for you given your vram and to run it using llama-server or lm studio (might be alternatives to lm studio but generally these are nice uis for llama server), it’ll use around 7-8GB for weights, leaving room for incidentals

Llama 3.3 could work for you

Devstral is too big but could run a quantized model

Gemma is good, tends to refuse a lot. Medgemma is a nice thing to have in case

“Uncensored” Dolphin models from Eric Hartford and “abliterated” models are what you want if you don’t want them refusing requests, it’s mostly not necessary for routine use, but sometimes you ask em to write a joke and they won’t do it, or if you wanted to do some business which involves defense contracting or security research, that kind of thing, could be handy.

Generally it’s bf16 dtype so you multiply the number of billions of parameters by two to get the model size unquantized

Then to get a model that fits on your rig, generally you want a quantized model, typically I go for “Q4_K_M” which means 4bits per param, so you divide the number of billions of params by two to calculate the vram for the weights.

Not sure the overhead for activations but might be a good idea to leave wiggle room and experiment with sizes well below 16GB

Llama server is a good way to run AI and has a gui on the index route and -hf to download models

LM Studio is a good gui and installs llama server for you and can help with managing models

Make sure you run some server that loads the model once. You definitely don’t want to load many gigabytes of weights into vram every question if you want fast realtime answers