Hacker News

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.