I run Qwen3-32B with Unsloth Dynamic Quants 2.0, quantized to 4+ bits, and with the key-value cache reduced to 8-bit. It's my favorite configuration so far. This configuration has the best quality/speed ratio at this moment imho.
It's pretty magical - it often feels like I'm talking to GPT-4o or o1, until it makes a silly mistake once in a while. It supports reasoning out of the box, which improves results considerably.
With the settings above, I get 60 tokens per second on an RTX 5090, because it fits entirely in GPU memory. It feels faster than GPT-4o. A 32k context with 2 parallel generations* consumes 28 GB of VRAM (with llama.cpp), so you still have 4 GB left for something else.
* I use 2 parallel generations because there's a few of us sharing the same GPU. If you use only 1 parallel generation, you can increase the context to 64k
I run Qwen3-32B with Unsloth Dynamic Quants 2.0, quantized to 4+ bits, and with the key-value cache reduced to 8-bit. It's my favorite configuration so far. This configuration has the best quality/speed ratio at this moment imho.
It's pretty magical - it often feels like I'm talking to GPT-4o or o1, until it makes a silly mistake once in a while. It supports reasoning out of the box, which improves results considerably.
With the settings above, I get 60 tokens per second on an RTX 5090, because it fits entirely in GPU memory. It feels faster than GPT-4o. A 32k context with 2 parallel generations* consumes 28 GB of VRAM (with llama.cpp), so you still have 4 GB left for something else.
* I use 2 parallel generations because there's a few of us sharing the same GPU. If you use only 1 parallel generation, you can increase the context to 64k
Comes with 32GB VRAM right?
Speaking of, would a Ryzen 9 12 core be nice for a 5090 setup?
Or should one really go dual 5090?