Hacker News

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

srigi 18 hours ago [ - ]

You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.

msp26 6 hours ago [ - ]

Interesting, I might try that, thanks!

ActorNightly 20 hours ago [ - ]

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

SwellJoe 18 hours ago [ - ]

Qwen is better at some things (code, in particular), but Gemma has better prose and better vision. At least, it feels that way to me.

zobzu 18 hours ago [ - ]

gemma is also just way faster. i dont wanna wait 10min to get a 5-10% better answer (and sometimes, actually worse answer).

best is to use your own model router atm, depending on the task

SwellJoe 18 hours ago [ - ]

I'm pretty sure Qwen is faster? The MoE version of Qwen is 3B active, while Gemma 4 is 4B active. Similarly, the dense Qwen is 27B while Gemma is 31B. All else being equal (though I know all else isn't equal), Qwen should be faster in both cases. I haven't actually measured with any precision, but on my AMD hardware (Strix Halo or dual Radeon Pro V620) they seem quite similar in both cases...both MoE models are fast enough for interactive use, both dense models are notably smarter but much slower, long time to first response and single-digit tokens per second once it starts talking.

vparseval 13 hours ago [ - ]

qwen-3.6 is really interesting. The dense 27B model is pretty slow for me whereas the sparse 31B is blazingly fast but it also needs to be since it's so chatty. It produces pages and pages of stream of consciousness stuff. 27B does this to a lesser extent but slow enough that I can actually read it whereas 31B just blasts by.

I haven't yet compared either to Gemma 4. I tried that out the day after it came out with the patched llama.cpp that added support for it but I couldn't make tool calling work and so it was kind of useless. I should try again to see if things have changed but judging by what people say, qwen-3.6 seems stronger for coding anyway.

ctbellmar a few seconds ago [ - ]

I had the same experience with 31B. Runs well on 4090 too!

Craighead 14 hours ago [ - ]

I'm using both incessantly and having a great time.

MikeTheGreat 17 hours ago [ - ]

Genuine question: how do you tune it?

I thought "fine-tuning" meant training it on additional data to add additional facts / knowledge? I might be mistaking your use of the word "tune", though :)

dr_kiszonka 8 hours ago [ - ]

You can fine-tune relatively easily in Unsloth Studio.

redman25 18 hours ago [ - ]

It’s a heck of a lot faster too.

2ndorderthought 19 hours ago [ - ]

Yes I would just go with qwen.