FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.

I second unsloth models. I'm using them over blackwell-oriented nvfp4 models as they are (empirically) top quality and performance.

NVFP4 will be better if the model provider actually post-trained properly after quantizing.

Which basically only Nvidia does, because it’s very expensive.

Though I’m currently working on QADing the smaller Qwen 3.5 models from FP16 teacher to NVFP4 student, to hopefully eventually apply it to 3.6 27B… harder to get right than I expected though!

[deleted]

I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.

Same problem with Gemma 4 + oMLX + OpenCode. The thinking and tool calling seems to be parsed fine in other clients such as Open WebUI. This really shouldn’t even matter because the client isn’t responsible for parsing the output, but it’s happening anyway.

possibly a problem with the chat template

https://huggingface.co/google/gemma-4-31B-it/discussions/118

Huh. Same problem, and I run with llama.cpp. In my case, Gemma4-31B (4-bit quant though) will just stop sometimes.

Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.

you can probably run Gemma4 26B on your card also at 4 bit. World of a difference compared with 12B.

Where does “big model highly quantized” start getting worse than “smaller model less quantized”? Is there a general formula or is it just trial and error?

paper is a bit old, but matches current empirical recommandation: a good starting point is the biggest model you can fit at 4 bit

https://arxiv.org/abs/2212.09720

Have you tried qwen 27b q4_K_XL? It's a little bigger than the 4080 but not too much