> The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).
Note that that's a skill issue of whoever quantized the model. In general quantization even as low as 3-bit can be almost loseless when you do quantization-aware finetuning[1] (and apparently you don't even need that many training tokens), but even if you don't want to do any extra training you can be smart as to which parts of the model you're quantizing and by how much to minimize the damage (e.g. in the worst case over-quantizing even a single weight can have disastrous consequences[2])
Some time ago I ran an experiment where I finetuned a small model while quantizing parts of it to 2-bits to see which parts are most sensitive (the numbers are the final loss; lower is better):
1.5275 mlp.downscale
1.5061 mlp.upscale
1.4665 mlp.gate
1.4531 lm_head
1.3998 attn.out_proj
1.3962 attn.v_proj
1.3794 attn.k_proj
1.3811 input_embedding
1.3662 attn.q_proj
1.3397 unquantized baseline
So as you can see quantizing some parts of the model affects it more strongly. The downprojection in the MLP layers is the most sensitive part of the model (which also matches with what [2] found), so it makes sense to quantize this part of the model less and instead quantize other parts more strongly. But if you'll just do the naive "quantize everything in 4-bit" then sure, you might get broken models.[1] - https://arxiv.org/pdf/2502.02631 [2] - https://arxiv.org/pdf/2411.07191
Interesting. I was aware of using an imatrix for the i-quants but didn't know you could use them for k-quants. I've not experimented with using imatrices in my local setup yet.
And it's not a skill issue... it's the default behaviour/logic when using k-quants to quantize a model with llama.cpp.