> Qwen 3.6:27b uses 29/32gb of vram
What context size are you using for that?
Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.
> Qwen 3.6:27b uses 29/32gb of vram
What context size are you using for that?
Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.
I squeezed it into 24 GiB VRAM (since I have RX7900XTX):
-- Q5_K_M Unsloth quantization on Linux llama.cpp
-- context 81k, flash attention on, 8-bit K/V caches
-- pp 625 t/s, tg 30 t/s
Depends entirely on quantization. Q6_K with max context length (262144) is ~40GB of VRAM.
Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.