Hacker News

How much ram do you need to run this !!??

Probably about one byte per weight (parameter) plus a bit extra for the key-value cache (depends on the size of the context window).

int_19h a day ago [ - ]

You can go below one byte per parameter. 4-bit quantization is fairly popular. It does affect quality - for some models more so than others - but, generally speaking, a 4-bit quantized model is still going to do significantly better than an 8-bit model with 1/2 parameters.