Hacker News

This is my experience too. Qwen optimizes for a lot of scenarios which masks their weaker generalization compared to US frontier models.

Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.

For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.