GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.
In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings
But when factoring in performance/cost, GLM 5.2 is the frontier model.
> but if you only want to use the best model available, it isn't there yet
I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price. And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable. Almost like buying a Ferrari for your daily commute instead of a Toyota or even a Mercedes.
I think there are several factors. Certainly marketing making us think we need the shiny thing which is rampant online and very smart people think they aren't susceptible to. There's a lot of really odd 'I trust Anthropic/OpenAI more than Deepseek' which tends to ignore, for starters, that you can run choose your provider and still save a ton. I also think there's some amount of addiction and brand loyalty where a Ferrari is one hell of a drive so that you turn your nose up at that sensible Toyota. Oh the other one I see used is like oh only fable can oneshot updating my embedded systems thing from 1975 to rust which is great but let's recognize how niche that is.
And it ends up just coming across as people are getting SO reliant on the tools so fast. Maybe it's ok to think and like read a few lines of code and work with these agents to convert your thing to rust or center your div. Even if coding is over which in some sense it certainly is, don't turn your mind into the wall-e people yet. I found myself guilty of this so often. It takes way more time and effort to do things via prompt and I wouldn't just open the editor and fix it because that dopamine hit of the magic the abstraction provided was so strong.
So I'm pretty much done using the 'best' (on benchmarks, if money isn't an object, etc etc) models available. After a year on Sonnet/Opus/GPT5x I'm having way better results with open weights models that don't get lobotomized weekly. I'm finding ways to do the crafting part of building software by focusing on honing my harness and workflow. I'm enjoying changing the oil on my Toyota after a year of almost flying off cliffs in my Ferrari and if I can check my ego it's a purely positive thing.
> I'm trying to wrap my head around exactly why so may people seem to want the best model available
This is the logical end point of the fear-based way LLMs are marketed. You must want the best, because everyone who has the best can work faster than you, generate more — if you don't have the best, you are behind! Why would you want to use anything other than the best?
The thing is, once everyone has the best, the question is: how much can you spend? If you can't spend more, you are behind! If spending the most will get you ahead, why would you not want to spend the most, if you can afford it?
There is only one way through this, in the long run: work out a way forward that doesn't make you dependent on this cycle. If you can compete at all, without the spend, what happens is: they burn money and you don't.
FWIW so far I don't think the benchmarks prove very much about the actual experience, and you can discover this just as easily without spending any money. And we know this about benchmarks! Once a benchmark seems useful as a measurement, it becomes a target and it stops being as useful.
I think your strategy is right. It requires bravery, and as you say, it requires ego balance. But I believe it is obvious that the world will either come around to a more sensible, stable pattern or it doesn't matter either way because we're fucked. So opting out of this mad early cycle and choosing to be calmer and happier is a choice you can just make.
> most halfway decent models can write damn good code for a fraction of the price.
The difference is how the model is used.
With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"
With the lessor models the code is fine, but they need something else to plan what needs to be done.
GLM-5.2 is the third model (after Opus 4.6+ and GPT-5.5) that can do this agentic style work.
Notably Gemini 3.1 Pro is notoriously bad at this style work - the code is good, but it drifts off task most of the time. 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.
My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities, it requires a little bit of work on a harness, a little bit more of my input, a little more of my brainpower. I _want_ to build tools that make it work better and don't change when the CC team gins up some default for their harness and foists it on me. I don't see that as a tradeoff at all and I think engaging in my work process more than fire and forget (and literally always in my experience fix stuff later) is more fun and rewarding once the 'holy shit this is now possible' high wears off. Doubly so once the frontier model gets nerfed mid-cycle and now I have to undo the mess because they released v*.x++ and I fell for it again by trusting it to do these agentic tasks without my involvement.
> My whole point is that I don't want it to build an entire feature from one prompt
You are free to do you. But you were asking about why others want the best model.
The answer is, clearly, agentic coding (ie multiple agents each cranking through tasks independently) lets you ship A LOT more business value if used correctly.
Yep. I've tried to use the models to build large things for me. You can't trust the code it produces. Even if it works there are parts that are hot garbage, and will bite you later on. I've found out that having an editor open, asking it to implement things until a certain point, manually fixing some of the worst things it generates, then asking it to expand from there is much better than just prompting a thing and pushing to production.
And hey, don't get me wrong, you can get pretty far with just prompting. But the subtle misses and (I'm looking at you GPT) the overengineered 20k line PRs to do a simple thing are going to cost you a lot if you're not vigilant.
> My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities
I don't think anyone is stopping you. This is an entirely valid way of working.
I for one am glad to leave that behind me. The sooner I never have to write another line of code the better (professional software engineer for nearly 30 years here, for context).
I don't know about you guys, but half of the time I give Opus something actually complicated, it spends 50+ minutes trying to understand the problem, running lots of searches and tool calls, and then gives up and just writes a brief summary of what it thought about. Biggest waste of tokens you can imagine.
I would say 3.5 flash is great if you use a good open harness. I use omp for that. The thing with Google is that they announce they have a great model, and that they have been testing it internally for half a year. I guess they don't care too much about who or how he uses it.
I am still struggling how to deal with sub agents and different roles for each model. I still think Claude or Codex are overall better models, but everything around them transpires such weird vibes, including, and this one kills me, that at certain times they feel like dumbed down.
I keep changing these things often, but I have basic subscription to codex (20$ plan) which I use with GLM 5.2 to do some high level planning of what I intend to do, and then leave Deepseek do the coding. Or something along those lines.
Point is, GLM 5.2 is now at a point where I cannot tell you if it's better or worse. I can tell you however one thing: no matter when I use it, it's consistent in what it does and how it works.
Then there is the Fable thing, but as with many things, I think the past has distorted the reality. It lasted two days, but Anthropic said it clearly for plan users it would only be there for two weeks. It was great for doing what you can already do with other tools: doing all the planning, and reviews, and launching a million subagents talking to each other. I sometimes wonder if it was really a new model, or just Opus 4.9 wrapped with some fancy model driven harness.
Big fan of Amp but pretty sure it only uses Flash for search: https://ampcode.com/models
As for Fable: I used it as much as I could while we had it.
It was a step change over Opus with my work.
Or maybe it was supposed to be the OMP (OhMy Pi) harness. Pi can do just about anything for you. Use most models in most ways possible. You just tell Pi what you want, and it builds an extension for itself.
> With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"
I've had no trouble getting the current generation of smaller models to do the same thing. Maybe it's more of a harness issue than a model issue?
Recently I've used both MiniMax M3 and DeepSeek V4 Flash to one-shot moderately complex applications from a written spec, and neither one got lost along the way
> 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.
Price and speed, for me. GLM5.2 is "good enough" for some tasks, but rather slow (on their coding plan). In the time it takes GLM to "read files to figure out...", gemini flash is usually finished. It's not SotA for coding, but it's fast and often "good enough" for normal tasks.
> Price and speed, for me.
For Flash 3.5?
I'm a big fan of Gemini 3.1 Flash Lite Preview (yes that is the name..).
I keep a agentic SQL benchmark up to-date to test new models. It's more-or-less saturated above 23/25 but below that is still useful, and even at that level is good for comparing speed, cost and toke efficiency.
3.1 Flash Lite Preview scores 22/25 in 142 seconds for $0.02. That's a great result if you care about cost for performance.
3.5 Flash scores 20/25 in 367 seconds for $0.76. The slow speed is because it takes a lot of tokens to generate its results, so even if tokens are produced quickly it takes too many to get a positive result.
There's nothing I've seen or heard that indicates 3.5 Flash is better than this indicates.
https://sql-benchmark.nicklothian.com/?highlight=google_gemi.... vs https://sql-benchmark.nicklothian.com/?highlight=google_gemi... (click the cells to see the traces)
Yeah, the funniest thing about everyone freaking out about Fable's capabilities recently was that for most of the stuff they were amazed by, you could get roughly the same result from DeepSeek Flash.
I used to be obsessed with what's the best model. Then a while back when the new best model came out, I tested it on a task. I also tested its little brother (much smaller model from same company).
They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...
"Best model" discourse always remember me of my days in Monster Hunter with people who refused to consider playing with anything other than the meta set for their weapon and then proceed to immediately cart right at the beginning of the hunt :)
With the wealth of models available (open source vs closed, api vs local), I find optimizing the cost-efficiency of your token consumption an important part of business-oriented AI engineering. You don't need "the best" for every task.
A lot of the monetarization strategies for LMM's depend on the need to use them via SaaS subscriptions. If companies start to realize that local AI is cheaper, provides good enough results and makes them independent then that monetarization strategy falls apart and a whole industry collapses.
> They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...
Same for me, I certainly don't have the same definition of success and failure either.
A more expensive model has *less* rooms for wandering around than a cheaper model.
If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.
I'm using DeepSeek v4 Flash through OpenCode and OpenRouter, and works just fine. It's not the bottleneck, I am, for what I'm building. That involves understanding the problem I'm solving, checking correctness
Meanwhile, it's such a cheap model that I've spent not even $25 over 3 weeks.
I would say one thing I've enjoyed about the latest frontier models from US labs is that you just work at a higher level of abstraction. You can talk about the end goal and it'll just rip. You'll add scaffolding to constrain the patterns etc, but I do way less baby sitting than I expected on 5.6 vs 5.4 vs Deepseek v4 Pro.
Reason people want the best: people want to believe their project is so advanced that they need the most clever LLM possible. To say otherwise is to admit that it's not really frontier or novel in any way. And people don't like that.
I’m writing a lot of React code and find that the cheaper models are pretty terrible. Maybe I’m holding it wrong but the experience that the cheaper model is usually enough just track with my experience. Worse, I find predicting the difficulty of tasks exceedingly difficult. More often than not using the initially cheaper models requires me to reroll with a more expensive one or waste a lot of times and tokens cleaning up the subpar results. With OpenAI and Anthropic still subsiding tokens, not using the best models still seems like a tough ask.
What happens when you find the models are terrible? The claimed results don't match? My dev cycle tends to be write a test for blah blah, add feature to satisfy test, make sure tests pass.
For math, even the frontier has shortcomings, and there is a steep drop from GPT 5.5 xhigh to anything else. The time wasted by less-than-SotA just isn't worth it.
I've landed in a similar place by reducing effort and cutting up tasks. I find that more exacting specifications to the models, yield significantly less need for "effort". Combining each with multjple git worktrees and an integration branch for the current worktrees themselves has yielded incresible results.
This also allows me to play with, and mix codex, claude cli, and others. This is my happy spot for the last two months.
Yeah this is sounds close to my workflow and its good to hear you've find a nice flow too! It frees me up to spend that effort on doing more things in parallel and focusing way more on the specs which is usually a good idea anyway.
Because not every problem is a coding problem or not entirely solvable through code. Other tasks include legal, philosophical, financial, investigative, and combinations of these and others.
It doesn't look like that's where the conversation was going, though.
> I'm trying to wrap my head around exactly why so may people seem to want the best model available
To me this is a "more expectations mean more disappointment" situation.
Some people have higher expectations than others, and even the best model available is not good enough for what those people really want it to do once you start digging. In that light, the goal is not using the best model, but rather using the least insidiously deficient model.
Many people chase the edge because it's the least disappointing.
> when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.
The fatuousness of this statement pretty quickly becomes apparent if you spend more time looking at it, IMO, because the venn diagram of "damn good" and "not nearly good enough" strongly overlaps. Even the best model writing excellent lines of code still has noticeably deficient ability to decide which excellent lines of code to write. The goal is to improve the separation between them, not save a few dollars, because the emotional effort is worth more to us than the money.
> And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable.
Your minimization of performance differences and maximization of stability differences is exposing your biases.
Side note: I think you should know that to me at least some of what you said reads like self-rationalized moralizing. I couldn't help but imagine Principal Skinner saying "Am I so out of touch? No, it's the children who are wrong." People don't only want different things than you do because they don't know what they're doing.
I don't drive the best car available on the market. I don't own the fastest and best PC/Laptop/Smartphones available. I don't live in the best house in my city. I made reasonable choices that balance my needs and my available budget.
I think people are grouping into two flows.
One group is trying to get the LLM to basically one shot everything and not properly reviewing the output.
Others are using the LLM to assist their human intelligence in a tight loop.
If you’re doing the former you really do need the best model available because that’s still right on the edge of what LLMs can do at best, and at worst you’re just shipping pure unmaintainable slop.
If you’re doing the latter then you can get away with a slightly less powerful model without it making a material difference because your human intelligence is filling in gaps
The later takes too much mental ressources, the same when reviewing truly the code generated by the former.
I generally started by reviewing but after a while (maximum in hours), I just can't keep up and resort to LLMs as sole reviewers.
not many want to admit this
Well put. I belong to the latter group as I feed small, granular tasks that I describe thoroughly to the LLM. I tried, however, to just give it a bigger scope task. Even best models produce sloppy code.
While the single functions/classes/structs/... can be well though out the code tends to lack cohesion, and especially maintainability. For instance, it never thinks: "I could put this logic in an interface/trait so that if the requirements change I can simply add a concrete implementation that satisfies the new requirements (and potentially use one of these for testing)".
Yes that's also my experience.
SoTA models can do reasonably good jobs on each ticket, but over time the architecture of the application starts degrading without a human in the loop.
The entropy increases slower with better models but the trend is always towards slop
>why so many people seem to want the best model available
In my case, I rarely ever go over the Claude/ChatGPT subscription limits, so might as well use those considered-best models. If I had to generate millions of lines of code, maybe I would've used the open models more.
I agree, but there are use cases for the 'best model' other than converting your 1975 stuff to rust: for use cases where LLMs are just getting started to be useful I really want to use the current 'best' model: e.g. CAD, PCB design etc. In particular anything which requires spatial reasoning. The short time I had access to Fable 5 - it was just way better than any other model.
Except that there is no application for AI in CAD that is better, more appropriate, more robust or more sensible than learning how to use a CAD package and doing it yourself.
It's not fast-changing, it's not abstract, it's just not that difficult, and where it is difficult, the AI cannot help you, because it is not capable of things you are capable of.
Learn CAD yourself. Honestly; I was sure I would never manage to learn CAD but it turns out to be interesting, rewarding, valuable and actually quite quick to learn.
An LLM certainly is not going to be able to do it better than you once you have a tiny bit of experience. (PCB design, perhaps, has a language to it that an LLM can make a bit more headway into, but as a non-PCB-designer I would still bet that it's more like CAD than code)
This is a refreshing perspective because recently I feel like I’m surrounded by people who think they can effectively implement complex software, just by hammering the best models.
It has been hard to explain that they are in fact just creating toy versions and there is no way they can do it without learning the underlying architecture. But they just keep going wasting 100s of dollars , lost in a sea of bugs
Until a few years ago I'd have been the person who thought you could make a text-to-CAD system scale up to all of it. And then I tried to make stuff I wanted.
Dabbled with OpenSCAD as we will. I decided to learn FreeCAD and what I discovered is that, even putting aside FreeCAD's many documented issues, parametric GUI CAD is not an imprecise, clumsy or fiddly way to work.
It is expressive, precise, generally capable of all the things that code-CAD can do and much more, and it's much, much quicker to work in, once you've learned a few core principles.
As you say, there is an underlying architecture; it's not just a sort of 3D paint package.
The problems the text-as-whatever crowd have are all Dunning-Kruger things in the truest sense.
People who are unaware they are unskilled in a particular technology are unlikely to successfully replace it with another. Particularly one that requires describing the problem domain in precise language.
Quite often when you see text-to-CAD discussions, especially here, there's evidence of profound misunderstandings from the people who think they are going to automate it. They assume their frustrations with the tools stem from limitations of the tools, not from the limits of their understanding.
As a person with decades of experience of code I have found learning how to use LLMs effectively to be much, much harder than learning CAD.
For me, the 20€/months subscriptions were always sufficient, and it's nice if that subscription give the latest and greatest results.
It depends. Claude’s $20 plan is kind of a mess.
It's also geeks and engineers using these models and being the most vocal. We always think we're special and need the extra horsepower. Ever been on one of those home lab subreddits ? Same story.
> I'm trying to wrap my head around exactly why so may people seem to want the best model available
I've been programming since I was a kid. I enjoy it a lot, I like knowing how things work, I get excited about new compiler features, I stayed up every night for a week when I discovered Lean 4, etc etc etc.
At the same time I realized a few years ago that I just don't want to write any code ever. Or read any code. Coding is addictive and fun, but I'd rather talk to the computer and have things magically get done. (FWIW learning how to use LLMs feels more.. fulfilling, too)
Anyway. GLM 5.2 is nice and all, but I might have to spend half an hour guiding it to come up with a plan I'm happy with. And with Opus it could be 15 minutes. I'm still going to spend an hour talking to LLMs one way or the other, but with Opus it will be a less frustrating hour. If Fable gives me a frustration-free hour, I'll switch to Fable.
>> I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.
The reason is pretty simple and has to do with statistics: on long-horizon tasks, small errors and deviations from the "good path" compound.
What is your favorite harness for the open weights?
We built our own and aren't done open sourcing it but before that I got to a really good place with opencode plus some custom agents, pi family is good too although I haven't used it as much. We made an agent to design a spec, one to implement by dispatching subagents, one to validate against the plan, things like that. All of this helps claude/gpt too IME. For open models it has helped them stay out of loops (e.g. Kimi's but WAIT) and for frontier it helps them stay on task and not invent bloated patterns
pi is great for learning, oh-my-pi has all the nice things included that I've built fory pi previously.
pi-mono
What is pi-mono ? (I heard about pi)
Of course people want the best model available, even at 10x costs, if they are not paying for it. If the company is paying, why wouldn't you want a 2% better model?
That changes as soon as the developer is the one paying for a model. Then it's a classical engineering trade-off between money and quality, and that's where open models are clear winners.
> most halfway decent models can write damn good code for a fraction of the price
The problem isn't what they do in a blank state. It is how they get there and the edge cases. Some models also take longer (uses more steps) i.e. end up costing more despite being "cheaper".
I've seen models:
- Back out plans non-stop. Tried the obvious path. Invents X/Y/Z excuse (without verifying) that it can't be done. Notes that down and moves on. It could be as simple as site A being down and to download from site B but that's it.
- Hacks the test to make it work. Code is wrong? Nah, let's update the test.
- Keep saying useless things like YAGNI and infinite excuses like too risky to never do the work.
- Claims they are done but there's 100 edge cases not covered. When you try to use it it fails in ways you as a human assume it should work. You can write a spec to cover it all but then what's the point?
- Be trigger happy and never investigate. Tries to do it. 5 minutes. Oh it failed. Back out. Repeat. Better models definitely spend more time analyzing and actually "think". I've had models spend hours trying to do a change due to this method when an actual investigation (code walkthrough) might have solved it.
- Know and use the right tools. A lot of lesser models have infinite fear e.g. oh docker might not be available (it is) or this and that (even if you nudge it in any way) and spend a lot of extra time "working around" it.
The list goes on. Better models definitely help.
Only thing to agree on is no you don't need Fable but saying Sonnet can do the job instead of Opus is a different story. It's so obvious when Sonnet touches the code that I can't give it more than 5 minutes. It lies. Doesn't check. Forgets things and then messes up.
[flagged]
In your box plots, 4.6 sonnet wins over all (even opus 4.6, the 4.8’s and fable).
That’s not super surprising to me, but, given the apparent randomness of the stack ranking, is GLM actually worse than any of the Anthropic models? This looks like a 10-way tie to me.
We've spent some time trying to understand this anomaly, even re-running Sonnet 4.6 through our evaluations to see if that would bring down its scores... and it didn't. I don't know what they did differently, but it's basically Opus 4.6 with more temperature variability (some great responses, some less great, with an approximately frontier median response in agentic work specifically). It is smart, methodical and excellent at tool calling in our custom environments.
We now use Sonnet 4.6 for a number of internal use cases we wouldn't have considered otherwise.
That tracks with my experience.
4.7 was so bad, I locked a bunch of my machines to 4.6.
I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.
It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.
Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.
Doesn't track with mine. I've been stuck with Sonnet 4.6 with one of the clients I work for. It writes code fine, but it's not nearly as good as the more recent models for everything else. It's fairly common for it to suddenly go off the rails for no good reason, so I can't really trust it with agentic loops. It's also not very good at diagnosing non-trivial issues. It's not uncommon for it to suggest whole lists of irrelevant / nonsensical reasons for something not working. Then I copy/paste the code and some context into chatgpt and it hones in onto the correct issue right away, even with inferior tooling.
> In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average.
Just want to express how amazing that is. Opus 4.6 is an amazing model. That an open weight model like GLM 5.2 competes with it is nothing short of outstanding.
What is the methodology of your benchmark?
On the contrary, I personally think these broader benchmarks are meaningless. I think personalized benchmarks are the way to go. They should answer "How does this model perform for MY use-case?" rather than trying to answer "How does this model perform across all coding environments?"
Case in point: I use Elixir which is not as popular as Python, is always a hit or miss with most SOTA models at the top of these benchmarks. Whereas, the ones in the middle of the benchmarks (like the GLM) almost always outperform even SOTA models from Google / Anthropic. However, this is relevant only for my use case and I wouldn't just advocate a model for everyone based off my use-case alone.
We use a rotating pool of ~100 games for the coding parts of the benchmark, and are scored objectively based on ratings similar to Elo. Models write code submissions to interact with the environment, then are evaluated in large batches against other submissions.
We test 11 popular/interesting languages (you can see the Languages chart to filter), but not Elixir -- although other evaluations have found that many LLMs solve more problems when working with Elixir [0]. Why models write code well in some languages over others seems to go beyond pre-training data (Python scores quite low for most models) and we don't fully understand it.
[0] https://elixirforum.com/t/llm-coding-benchmark-by-language/7...
An expressive and well designed language (elixir) is objectively better than a less well designed language like python. Python probably needs more LoC than elixir for the same task. Python is also untyped by default.
Elixir is not just expressive, it's highly conventional. I've found best practice code usually converges on the same idiomatic patterns, and well written codebases look very similar to each other in style
Thanks!
Opus 4.6 is still my preferred model for work, so this is great to hear.
I can't wait for open models to take over in all categories.
Sounds like this is the year for coding.
It looks possible open models will. I never expected the reason would be political/legal rather than technical.
The CEOs spent so much time talking about putting everyone out of work and how "unsafe" their models were that the government stepped in with export controls.
They did this to themselves.
Opus 4.6 was better than the current 4.8 in my subjective opinion using it. I have no real reference since in Europe mythos and its sister models aren't available...
So having a model of 4.6 quality is still extremely awesome. That currently is more of less the frontier reference outside the US :(
If a good SWE is $150/hour, does the model cost actually matter? Surely you'd be willing to spend $10/hour to make that SWE 20% more productive? The model cost is still much less than the salary.
With Claude Code Ultrathink, I used 3 million tokens in 20 minutes. At API prices, that would be around 30$. So 90$/h. Model cost is not that much lower.
x40hrs/week * 50 weeks = $180k
Congrats, now you’re paying an engineer’s salary to make your engineer at best 20% more productive.
Better to hire another engineer, or two jrs, and build up your in house talent.
Only you get things done lost faster and don't need to pay entire years salary?
How so?
If you’re doing a one-off project, sure. But if you’re coding like this full time, you’re paying the year’s salary anyway.
And faster? Not so sure about that. Sure they can write code faster, but writing code is a small part of building something.
except this is way more than an engineering salary. At least in Europe.
I’m sure there are engineers making $180k usd / year in the eu. Maybe it’s unusual, but hey, now you can cancel your claude subscription and hire a really good engineer
I don’t think any engineers who cost $150/hr are having their productivity moved by 20% depending on a $10/hr gap between models on or near the frontier.
Most of the gains right now come from tooling and process and any big post 2025 language model. The specific model isn’t that important right now.
Exactly. And being able to choose your own tools is much more valuable than having a tiny bit better model.
But SOTA models used liberally at API pricing is a lot more than $10/hour. You can probably burn $100+/hour with just a single agent, and probably thousands when running agents programmatically, e.g. workflows.
Man, there is exactly zero information on your site about how your benchmarks work. Why should one trust your numbers when there is no way to verify them?
Scroll to the bottom for the methodology (sorry, this should be linkable)
I find it hard to trust a ranking system that gives Sonnet a higher capability score than Fable.
It would have made things easier for us if Sonnet 4.6 scored lower, but it's a great model and the data is real.
It doesn't have a higher capability score than Fable, though. We break our coding evaluations into 2 parts, and "one-shot coding" makes up part of the index, where Fable significantly outperforms every other model, which is why it's ranked at the top despite Sonnet 4.6 having a slightly higher median (and lower average) in long-horizon agentic workloads. One-shot coding tends to be the most correlated with other companies' model cards, whereas agentic coding is partly about how well a model can adapt to a custom harness. Fable also refused some tasks.
Data at https://gertlabs.com/rankings?ow=1&mode=oneshot_coding
Why is Sonnet 4.6 ranked higher than Opus 4.6?
Sonnet 4.6 is ahead of Opus 4.7? Hm.
After having used GLM 5.2 and Opus 4.8 for enough time, I'm very unconvinced of the benchmark maxxing claims - if anything, GLM 5.2's rather lackluster performance on benchmarks compared to Opus 4.8 paints the opposite picture when compared to the subjective experience.
When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.
I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.
I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.
And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.
I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.
I really dislike opus 4.8 it rarely compete things and prefer to waste tokens making lists of things that are missing. When stuck or need input it words the challenge at length without conveying anything useful for decision making, and quite often its solution to problems is to excise features or just try catch errors and proceed with faulty data silently
Why Deepseek v4 flash is better than pro in your benchmarks?
It's 100% due to tool use -- Flash adapts much better to our custom harness with tool names that are not identical to what models were likely trained on. DeepSeek V4 Pro performs much worse in that aspect than almost all other recent releases, for whatever reason.
I have also found deepseek flash beat pro in some of my own internal evals for tasklet.ai it’s really surprising and I don’t understand it
Same.. although rare, but have observed twice till date.
Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.
The rumour is that it's trained on Opus, but who knows
Oh of course all deepseek and glm are. Multiple people have seen GLM self report that it is claude, which makes it super obvious.
I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.
Claude claims to be deepseek, under some circumstances:
https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...
Don't ask western llms in Chinese what model they are...
maybe they distilled claude for the flash version and not for the other hence better tool use and programming benchmarks
This was a preview release. They haven't finish training. The Pro contains more knowledge but it probably takes longer training than flash for the smarts to kick in.
Notice the website url is the same name as the commentor.
Notice he's using "trust me bro" benchmarks.
Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.
Everyone is grinding and marketing nobody is actually discussing anything for real.
What does this even mean?
It means people have self-inflicted AI psychosis
It's always been ok for people to talk about their projects here. In fact it's encouraged.