I'd also recommend you go with something like 8b, so you can have the other 8GB of vram for a decent sized context window. There's tons of good 8b ones, as mentioned above. If you go for the largest model you can fit, you'll have slower inference (as you pass in more tokens) and smaller context.

I think your recommendation falls within

> all of them will have some strengths and weaknesses

Sometimes a higher parameter model with less quantization and low context will be the best, sometimes lower parameter model with some quantization and huge context will be the best, sometimes high parameter count + lots of quantization + medium context will be the best.

It's really hard to say one model is better than another in a general way, since it depends on so many things like your use case, the prompts, the settings, quantization, quantization method and so on.

If you're building/trying to build stuff depending on LLMs in any capacity, the first step is coming up with your own custom benchmark/evaluation that you can run with your specific use cases being put under test. Don't share this publicly (so it doesn't end up in the training data) and run it in order to figure out what model is best for that specific problem.

8b is the number of parameters. The most common quant is 4 bits per parameter so 8b params is roughly 4GB of VRAM. (Typically more like 4.5GB)

The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).

Anything below 4-bits is usually not worth it unless you want to experiment with running a 70B+ model -- though I don't have any experience of doing that, so I don't know how well the increased parameter size balances the quantization.

See https://github.com/ggml-org/llama.cpp/pull/1684 and https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... for comparisons between quantization levels.

> The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).

Note that that's a skill issue of whoever quantized the model. In general quantization even as low as 3-bit can be almost loseless when you do quantization-aware finetuning[1] (and apparently you don't even need that many training tokens), but even if you don't want to do any extra training you can be smart as to which parts of the model you're quantizing and by how much to minimize the damage (e.g. in the worst case over-quantizing even a single weight can have disastrous consequences[2])

Some time ago I ran an experiment where I finetuned a small model while quantizing parts of it to 2-bits to see which parts are most sensitive (the numbers are the final loss; lower is better):

    1.5275   mlp.downscale
    1.5061   mlp.upscale
    1.4665   mlp.gate
    1.4531   lm_head
    1.3998   attn.out_proj
    1.3962   attn.v_proj
    1.3794   attn.k_proj
    1.3811   input_embedding
    1.3662   attn.q_proj
    1.3397   unquantized baseline
So as you can see quantizing some parts of the model affects it more strongly. The downprojection in the MLP layers is the most sensitive part of the model (which also matches with what [2] found), so it makes sense to quantize this part of the model less and instead quantize other parts more strongly. But if you'll just do the naive "quantize everything in 4-bit" then sure, you might get broken models.

[1] - https://arxiv.org/pdf/2502.02631 [2] - https://arxiv.org/pdf/2411.07191

Interesting. I was aware of using an imatrix for the i-quants but didn't know you could use them for k-quants. I've not experimented with using imatrices in my local setup yet.

And it's not a skill issue... it's the default behaviour/logic when using k-quants to quantize a model with llama.cpp.

With a 16GB GPU you can comfortably run like Qwen3 14B or Mistral Small 24B models at Q4 to Q6 and still have plenty of context space and get much better abilities than an 8B model.

Can system RAM be used for context (albeit at lower parsing speeds)?

Yeah, but it sucks. In fact, if you get the wrong graphics card and the memory bandwidth/speeds suck, things will suck too, so RAM is even worse (other than m1/m2/m3 stuff).

What if one is okay with walking away and letting it run (crawl)? Then context can be 128 or 256GB system RAM while the model is jammed into what precious VRAM there is?

I’m curious (as someone who knows nothing about this stuff!)—the context window is basically a record of the conversation so far and other info that isn’t part of the model, right?

I’m a bit surprised that 8GB is useful as a context window if that is the case—it just seems like you could fit a ton of research papers, emails, and textbooks in 2GB, for example.

But, I’m commenting from a place of ignorance and curiosity. Do models blow up the info in the context window, maybe do some processing to pre-digest it?

Yes, every token is expanded into a vector that can be many thousand of dimensions. The vectors are stored for every token and every layer.

You absolutely can not fill even a single research paper in 2 GB much less an entire book.