Hacker News

The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).

Anything below 4-bits is usually not worth it unless you want to experiment with running a 70B+ model -- though I don't have any experience of doing that, so I don't know how well the increased parameter size balances the quantization.

See https://github.com/ggml-org/llama.cpp/pull/1684 and https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... for comparisons between quantization levels.

kouteiheika 3 days ago [ - ]

> The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).

Note that that's a skill issue of whoever quantized the model. In general quantization even as low as 3-bit can be almost loseless when you do quantization-aware finetuning[1] (and apparently you don't even need that many training tokens), but even if you don't want to do any extra training you can be smart as to which parts of the model you're quantizing and by how much to minimize the damage (e.g. in the worst case over-quantizing even a single weight can have disastrous consequences[2])

Some time ago I ran an experiment where I finetuned a small model while quantizing parts of it to 2-bits to see which parts are most sensitive (the numbers are the final loss; lower is better):

    1.5275   mlp.downscale
    1.5061   mlp.upscale
    1.4665   mlp.gate
    1.4531   lm_head
    1.3998   attn.out_proj
    1.3962   attn.v_proj
    1.3794   attn.k_proj
    1.3811   input_embedding
    1.3662   attn.q_proj
    1.3397   unquantized baseline

So as you can see quantizing some parts of the model affects it more strongly. The downprojection in the MLP layers is the most sensitive part of the model (which also matches with what [2] found), so it makes sense to quantize this part of the model less and instead quantize other parts more strongly. But if you'll just do the naive "quantize everything in 4-bit" then sure, you might get broken models.

[1] - https://arxiv.org/pdf/2502.02631 [2] - https://arxiv.org/pdf/2411.07191

rhdunn 3 days ago [ - ]

Interesting. I was aware of using an imatrix for the i-quants but didn't know you could use them for k-quants. I've not experimented with using imatrices in my local setup yet.

And it's not a skill issue... it's the default behaviour/logic when using k-quants to quantize a model with llama.cpp.