> I have come to consider Gemma 4 31b the best model I can self-host

I'm confused. Your own results show that Gemma 4 26B A4B and Qwen3.6-27B did better in these tests?

I really like Gemma 4 31B, especially with how exceptionally good its MTP drafter is, but it is absurdly weak at tool calling and instruction following in my testing, and its smaller siblings are even worse at this. If the system prompt says to do something, Gemma 4 31B will very often ignore that entirely. It will also make fewer tool calls than were needed to solve a problem, so then it fails. The Qwen3.6 series is much, much more reliable for carrying out instructions and doing agentic tasks in my testing, although they can get stuck in loops.

There is a lot of potential in the Gemma 4 series, but I think Google needs to release a Gemma 4.1 update to polish the rough edges. Unfortunately, if Gemma 3's lifecycle is any indication, Google won't release a true revision of the Gemma 4 models, even if they release a bunch of specialized research models based on Gemma 4 over the next year.

I have done replication tests of Qwen and the Gemma models. The Qwen benchmarks are published: https://swelljoe.com/post/qwen-quantization-degradation/ . (Though, I still want to add the other three cases to that one. I was mostly testing quantization effects in that test, but it also served as a replication test of Qwen in finding bugs.)

The Gemma 4 replication tests are not published, yet, but Gemma 4 31B consistently performs the best of all of them. Note Gemma 4 31b has two "partials" on the big benchmark, which means it found a bug in the right place but the judge didn't think it understood the bug, those are probably unfairly judged "wrong bug" by Opus. It consistently finds four of nine, and sometimes finds two others, making Gemma 4 31b the best model I've tested. But, I suspect the big models would do even better if giving multiple attempts, as I did for Gemma 4. You can see the report of that here, note 31b finds six(!) of nine bugs if given a couple of attempts (MoE does much worse than the dense model, it may degrade more due to quantization, I'm still experimenting): https://swelljoe.com/html/gemma-promptlab-report.html

The "partial" score thing is kinda tricky, but it's actually quite rare for a model to find the right place but describe the bug in a way that Opus considers it to be the wrong bug. So, I'm inclined to give Gemma 4 full credit for those finds. When I read its bug report, it's clear that you'd fix the problem Gemma describes the same way as you would if given Opus' description of the problem, even if the mechanism of exploit is different. That, to me, is a hit. Opus called it the wrong bug.

And, yeah, a more powerful Gemma would be great. I'd love a double-sized Gemma 4 MoE (something like 70B A8B maybe, or even 122B A12B). I think that'd make self-hosted models feasible for a lot of tasks. It'd run comfortably on a 128GB machine, and if it's some reasonable amount smarter than the 31B, it'd be a real beast.