something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.
something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.
I've experimented with this with diffusion models with a safetensors - gguf tool I wrote. even with relatively few sample images (~10k, still enough to keep my 3090 spinning for days straight) the benefits are quite noticeable - a smaller file with overall better results.
This is a thing! For example, https://arxiv.org/abs/2511.06516
that's brilliant, I wonder why we haven't seen much use of it to do very heavy quantization
This is a very well established idea. It's called dynamic quantization. Vary the quantization bit-width (or skip quantization altogether) on a layer by layer basis, using a calibration dataset.
EvoPress is the first time that comes to my mind, when I think of dynamic quantization.
https://arxiv.org/abs/2410.14649