Hacker News

Q4 is usually around 4.5 bits per parameter but can be more as some layers are quantised to a higher precision, which would suggest 30 billion * 4.5 bit = 15.7GB, but the quant the GP is using is 17.3GB and 19.7GB for the article. Add around 20-50% overhead for various things and then some % for each 1k of tokens in the context and you're probably looking at no more than 32GB. If you're using something like llama.cpp which can offload some of the model to the GPU you'll still get decent performance even on a 16gb VRAM GPU.