Generally speaking, how can you tell how much vram a model will take? It seems to be a valuable bit of data which is missing from downloadable models (gguf) files.

Very rougly you can consider the Bs of a model as GBs of memory then it depends on the quantization level. Say for an 8B model:

- FP16: 2x 8GB = 16GB

- Q8: 1x 8GB

- Q4: 0.5x 8GB = 4GB

It doesn't 100% neatly map like this but this gives you a rough measure. In top of this you need some more memory depending on the context length and some other stuff.

Rationale for the calculation above: A model is basically a billions of variables with a floating number value. So the size of a model roughly maps to number of variables (weights) x word-precision of each variable (4, 8, 16bits..)

You don't have to quantize all layers to the same precision this is why sometimes you see fractional quantizations like 1.58bits.

The 1.58bit quantization is using 3 values -- -1, 0, 1. The bits number comes from log_2(3) = 1.58....

For that level you can pack 4 weights in a byte using 2 bits per byte. However, there is one bit configuration in each that is unused.

More complex packing arrangements are done by grouping weights together (e.g. a group of 3) and assigning a bit configuration to each combination of values into a lookup table. This allows greater compression closer to the 1.68 bits value.

Depends on quantization etc. But there are good calculators that will calculate for your KV cache etc as well: https://apxml.com/tools/vram-calculator.