Okay so if this model is half a year behind, so let’s say January opus pre-nerf, this is it.

Inference is actually quite cheap for token costs, the frontier labs burn most of their money on training new models, priced into their token costs ontop of some margins and paying record salaries. So if this goes open, distills are tried out, independent providers around the world host it with actual price competition, the house of cards for anthropic collapses pre-ipo. The floor is opus (open models caught up), the current ceiling is Mythos (self inflicted ban due to the safety bullshit theater), and no way out.

It’s really comical I think it’s even the same guy that warned about gpt2 being too dangerous to release, well that mindset seems to now doing existential harm to anthropic, while the rest of the world essentially laughs and progresses anyway.

Quit my Claude pro subscription last week and purchased credits for an API inference provider. I think I might even end up saving money, since I really don’t use AI that much, and I actually found that gemma4:31b is fine for most of my non-coding inquiries.

Gemma is amazing with tools for anything that is not crazy complex. I think a lot of people have a wrong perception of it because Google's new prompt format broke implementations like llama.cpp and it took quite a while to get everything sorted. But even the tiny variants running on edge devices are surprisingly capable when used right.

The frontier will probably keep moving for a while, but it will be increasingly disconnected from normal human use. In the future, if you're not trying to solve a research level math problem, you'll probably do it locally and fully privately. Which also means the payday when they will fundamentally no longer be able to reach a billion users with frontier models will come soon for the labs. Even if they do get their IPO out, it will probably crash and burn at current valuations.

Do you guys actually work with these models?

I have to use GPT 5.4 Mini at work. It benchmarks higher than that Gemma 4 model.

In my experience it's next to useless. It cannot even move 20 existing lines of code from A to B without breaking them half of the time.

If you tell it to look something up in your dependencies, it's 50/50 on whether the answer is correct, incorrect, or it simply didn't perform the search at all.

I find it next to useless, and I'm mostly better off doing the work manually.

It's a night and day difference to even Sonnet, not to mention the SOTA.

“Moving lines of code” is a very peculiar eval tbh. I’ve never used Gemma for agentic tasks, but did have it write code, including multi-turn, and I was very positively surprised how well it performed.

>It benchmarks higher than that Gemma 4 model.

Depends on what you look at. Gemma 4 31B without reasoning benchmarks significantly higher than GPT-5.4 without reasoning on artificial analysis. Even the new Gemma 4 12B beats it. And while GPT-5.4 with xhigh reasoning beats the reasoning version of Gemma 4 31B, the question is why you would throw such a complicated task that needs so much reasoning at such a small model to begin with. So if you do coding, you'll probably not have much success with either model. But for actual simple tasks that these models were made for, they are extremely capable. E.g. hook it up to the Atlassian MCP and have it do all the stuff that is supplemental to coding in big enterprises.

Counter: I use 5.4 mini all time for coding. No trouble letting it implement features. Entire new screens, APIs and various components.

It ain’t the best for sure, but if you have trouble letting it move 20 lines I don’t know what’s the cause but that’s not my experience at all. I do make pretty extensive use of guardrails and proper instructions in my AGENTS.md.

I also value super boring code bases with an as much as possible uniform shape. I guess that’s also helping out.

Like I said in my original comment, it’s fine for non-coding tasks, meaning I primarily use it to answer questions

Cursor 2.5 is essentially kimi and I find it eminently usable.

i use for tasks like object recognition in my family photos and cooking videos . seems to be fine

[dead]

Got a link to that API inference provider?

Just look up OpenRouter, OpenCode Go/Zen, Together, Fireworks, Cerebras, etc.

DeepSeek Platform API is worth checking out too, due to their insanely good caching and token costs.

I use DeepSeek via OpenRouter, the caching seems to work there too, you just need to force it to use DeepSeek as a provider otherwise it picks a random one every time. (You can pass a provider option in the call, or better, create a preset in your account.)

I'm Ollama Cloud which has a coding plan style model but without restrictions on the harness or direct API calls from your code.

I use novita ai

Gpt2 was too dangerous to release. We just don't see it yet.

Sure, the model itself was harmless, but it lit the fuse

Actually many of us do see that, and have been saying so for some time now.

I worked in this field since long before LLMs. Nobody outside of the field really cared about GPT2, and even insiders knew the "too dangerous" part was a PR gag at best and the first dig of the moat at worst. After all, they released smaller versions of it along with detailed instructions on training it in the paper, so anyone with a lot of compute and a bunch of internet scrapers could try to recreate it. But basically noone did, even though it would have only cost ~50k back then (and less than 3k today). A few normal users started to take notice with GPT 3, but even then it was super limited. Even instructGPT didn't cause real shockwaves, despite being very close to the final product. Only ChatGPT/3.5 finally lit the fuse and people suddenly cared about having this too.

Since we’re doing anecdotes I definitely agree GPT2 lit the fuse. It woke up a sizable chunk of people paying attention. GPT3 is when I and many others got into a full blown existential crisis - it was the bang after the fuse. Then we got a long tail of laggards and people without vision. Even today you can find a significant chunk of folks in denial still.

fair point

Is it going to actually be open source or just open weights? I'm looking forward to trying this with opencode regardless!