Hacker News

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.

beastman82 a day ago [ - ]

I second unsloth models. I'm using them over blackwell-oriented nvfp4 models as they are (empirically) top quality and performance.

kroaton 21 hours ago [ - ]

NVFP4 will be better if the model provider actually post-trained properly after quantizing.

girvo 13 hours ago [ - ]

Which basically only Nvidia does, because it’s very expensive.

Though I’m currently working on QADing the smaller Qwen 3.5 models from FP16 teacher to NVFP4 student, to hopefully eventually apply it to 3.6 27B… harder to get right than I expected though!

a day ago [ - ]

[deleted]

nozzlegear a day ago [ - ]

I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.

anon373839 8 hours ago [ - ]

Same problem with Gemma 4 + oMLX + OpenCode. The thinking and tool calling seems to be parsed fine in other clients such as Open WebUI. This really shouldn’t even matter because the client isn’t responsible for parsing the output, but it’s happening anyway.

acrispino 18 hours ago [ - ]

possibly a problem with the chat template

https://huggingface.co/google/gemma-4-31B-it/discussions/118

clusterhacks a day ago [ - ]

Huh. Same problem, and I run with llama.cpp. In my case, Gemma4-31B (4-bit quant though) will just stop sometimes.

accrual a day ago [ - ]

Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.

nok22kon a day ago [ - ]

you can probably run Gemma4 26B on your card also at 4 bit. World of a difference compared with 12B.

zingar a day ago [ - ]

Where does “big model highly quantized” start getting worse than “smaller model less quantized”? Is there a general formula or is it just trial and error?

nok22kon 21 hours ago [ - ]

paper is a bit old, but matches current empirical recommandation: a good starting point is the biggest model you can fit at 4 bit

https://arxiv.org/abs/2212.09720

boppo1 10 hours ago [ - ]

Have you tried qwen 27b q4_K_XL? It's a little bigger than the 4080 but not too much