> but if you only want to use the best model available, it isn't there yet

I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price. And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable. Almost like buying a Ferrari for your daily commute instead of a Toyota or even a Mercedes.

I think there are several factors. Certainly marketing making us think we need the shiny thing which is rampant online and very smart people think they aren't susceptible to. There's a lot of really odd 'I trust Anthropic/OpenAI more than Deepseek' which tends to ignore, for starters, that you can run choose your provider and still save a ton. I also think there's some amount of addiction and brand loyalty where a Ferrari is one hell of a drive so that you turn your nose up at that sensible Toyota. Oh the other one I see used is like oh only fable can oneshot updating my embedded systems thing from 1975 to rust which is great but let's recognize how niche that is.

And it ends up just coming across as people are getting SO reliant on the tools so fast. Maybe it's ok to think and like read a few lines of code and work with these agents to convert your thing to rust or center your div. Even if coding is over which in some sense it certainly is, don't turn your mind into the wall-e people yet. I found myself guilty of this so often. It takes way more time and effort to do things via prompt and I wouldn't just open the editor and fix it because that dopamine hit of the magic the abstraction provided was so strong.

So I'm pretty much done using the 'best' (on benchmarks, if money isn't an object, etc etc) models available. After a year on Sonnet/Opus/GPT5x I'm having way better results with open weights models that don't get lobotomized weekly. I'm finding ways to do the crafting part of building software by focusing on honing my harness and workflow. I'm enjoying changing the oil on my Toyota after a year of almost flying off cliffs in my Ferrari and if I can check my ego it's a purely positive thing.

> I'm trying to wrap my head around exactly why so may people seem to want the best model available

This is the logical end point of the fear-based way LLMs are marketed. You must want the best, because everyone who has the best can work faster than you, generate more — if you don't have the best, you are behind! Why would you want to use anything other than the best?

The thing is, once everyone has the best, the question is: how much can you spend? If you can't spend more, you are behind! If spending the most will get you ahead, why would you not want to spend the most, if you can afford it?

There is only one way through this, in the long run: work out a way forward that doesn't make you dependent on this cycle. If you can compete at all, without the spend, what happens is: they burn money and you don't.

FWIW so far I don't think the benchmarks prove very much about the actual experience, and you can discover this just as easily without spending any money. And we know this about benchmarks! Once a benchmark seems useful as a measurement, it becomes a target and it stops being as useful.

I think your strategy is right. It requires bravery, and as you say, it requires ego balance. But I believe it is obvious that the world will either come around to a more sensible, stable pattern or it doesn't matter either way because we're fucked. So opting out of this mad early cycle and choosing to be calmer and happier is a choice you can just make.

> most halfway decent models can write damn good code for a fraction of the price.

The difference is how the model is used.

With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"

With the lessor models the code is fine, but they need something else to plan what needs to be done.

GLM-5.2 is the third model (after Opus 4.6+ and GPT-5.5) that can do this agentic style work.

Notably Gemini 3.1 Pro is notoriously bad at this style work - the code is good, but it drifts off task most of the time. 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.

My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities, it requires a little bit of work on a harness, a little bit more of my input, a little more of my brainpower. I _want_ to build tools that make it work better and don't change when the CC team gins up some default for their harness and foists it on me. I don't see that as a tradeoff at all and I think engaging in my work process more than fire and forget (and literally always in my experience fix stuff later) is more fun and rewarding once the 'holy shit this is now possible' high wears off. Doubly so once the frontier model gets nerfed mid-cycle and now I have to undo the mess because they released v*.x++ and I fell for it again by trusting it to do these agentic tasks without my involvement.

> My whole point is that I don't want it to build an entire feature from one prompt

You are free to do you. But you were asking about why others want the best model.

The answer is, clearly, agentic coding (ie multiple agents each cranking through tasks independently) lets you ship A LOT more business value if used correctly.

Yep. I've tried to use the models to build large things for me. You can't trust the code it produces. Even if it works there are parts that are hot garbage, and will bite you later on. I've found out that having an editor open, asking it to implement things until a certain point, manually fixing some of the worst things it generates, then asking it to expand from there is much better than just prompting a thing and pushing to production.

And hey, don't get me wrong, you can get pretty far with just prompting. But the subtle misses and (I'm looking at you GPT) the overengineered 20k line PRs to do a simple thing are going to cost you a lot if you're not vigilant.

> My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities

I don't think anyone is stopping you. This is an entirely valid way of working.

I for one am glad to leave that behind me. The sooner I never have to write another line of code the better (professional software engineer for nearly 30 years here, for context).

I don't know about you guys, but half of the time I give Opus something actually complicated, it spends 50+ minutes trying to understand the problem, running lots of searches and tool calls, and then gives up and just writes a brief summary of what it thought about. Biggest waste of tokens you can imagine.

I would say 3.5 flash is great if you use a good open harness. I use omp for that. The thing with Google is that they announce they have a great model, and that they have been testing it internally for half a year. I guess they don't care too much about who or how he uses it.

I am still struggling how to deal with sub agents and different roles for each model. I still think Claude or Codex are overall better models, but everything around them transpires such weird vibes, including, and this one kills me, that at certain times they feel like dumbed down.

I keep changing these things often, but I have basic subscription to codex (20$ plan) which I use with GLM 5.2 to do some high level planning of what I intend to do, and then leave Deepseek do the coding. Or something along those lines.

Point is, GLM 5.2 is now at a point where I cannot tell you if it's better or worse. I can tell you however one thing: no matter when I use it, it's consistent in what it does and how it works.

Then there is the Fable thing, but as with many things, I think the past has distorted the reality. It lasted two days, but Anthropic said it clearly for plan users it would only be there for two weeks. It was great for doing what you can already do with other tools: doing all the planning, and reviews, and launching a million subagents talking to each other. I sometimes wonder if it was really a new model, or just Opus 4.9 wrapped with some fancy model driven harness.

Big fan of Amp but pretty sure it only uses Flash for search: https://ampcode.com/models

As for Fable: I used it as much as I could while we had it.

It was a step change over Opus with my work.

Or maybe it was supposed to be the OMP (OhMy Pi) harness. Pi can do just about anything for you. Use most models in most ways possible. You just tell Pi what you want, and it builds an extension for itself.

> With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"

I've had no trouble getting the current generation of smaller models to do the same thing. Maybe it's more of a harness issue than a model issue?

Recently I've used both MiniMax M3 and DeepSeek V4 Flash to one-shot moderately complex applications from a written spec, and neither one got lost along the way

> 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.

Price and speed, for me. GLM5.2 is "good enough" for some tasks, but rather slow (on their coding plan). In the time it takes GLM to "read files to figure out...", gemini flash is usually finished. It's not SotA for coding, but it's fast and often "good enough" for normal tasks.

> Price and speed, for me.

For Flash 3.5?

I'm a big fan of Gemini 3.1 Flash Lite Preview (yes that is the name..).

I keep a agentic SQL benchmark up to-date to test new models. It's more-or-less saturated above 23/25 but below that is still useful, and even at that level is good for comparing speed, cost and toke efficiency.

3.1 Flash Lite Preview scores 22/25 in 142 seconds for $0.02. That's a great result if you care about cost for performance.

3.5 Flash scores 20/25 in 367 seconds for $0.76. The slow speed is because it takes a lot of tokens to generate its results, so even if tokens are produced quickly it takes too many to get a positive result.

There's nothing I've seen or heard that indicates 3.5 Flash is better than this indicates.

https://sql-benchmark.nicklothian.com/?highlight=google_gemi.... vs https://sql-benchmark.nicklothian.com/?highlight=google_gemi... (click the cells to see the traces)

[deleted]

Yeah, the funniest thing about everyone freaking out about Fable's capabilities recently was that for most of the stuff they were amazed by, you could get roughly the same result from DeepSeek Flash.

I used to be obsessed with what's the best model. Then a while back when the new best model came out, I tested it on a task. I also tested its little brother (much smaller model from same company).

They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...

"Best model" discourse always remember me of my days in Monster Hunter with people who refused to consider playing with anything other than the meta set for their weapon and then proceed to immediately cart right at the beginning of the hunt :)

With the wealth of models available (open source vs closed, api vs local), I find optimizing the cost-efficiency of your token consumption an important part of business-oriented AI engineering. You don't need "the best" for every task.

A lot of the monetarization strategies for LMM's depend on the need to use them via SaaS subscriptions. If companies start to realize that local AI is cheaper, provides good enough results and makes them independent then that monetarization strategy falls apart and a whole industry collapses.

> They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...

Same for me, I certainly don't have the same definition of success and failure either.

A more expensive model has *less* rooms for wandering around than a cheaper model.

If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.

I would say one thing I've enjoyed about the latest frontier models from US labs is that you just work at a higher level of abstraction. You can talk about the end goal and it'll just rip. You'll add scaffolding to constrain the patterns etc, but I do way less baby sitting than I expected on 5.6 vs 5.4 vs Deepseek v4 Pro.

I'm using DeepSeek v4 Flash through OpenCode and OpenRouter, and works just fine. It's not the bottleneck, I am, for what I'm building. That involves understanding the problem I'm solving, checking correctness

Meanwhile, it's such a cheap model that I've spent not even $25 over 3 weeks.

Reason people want the best: people want to believe their project is so advanced that they need the most clever LLM possible. To say otherwise is to admit that it's not really frontier or novel in any way. And people don't like that.

I’m writing a lot of React code and find that the cheaper models are pretty terrible. Maybe I’m holding it wrong but the experience that the cheaper model is usually enough just track with my experience. Worse, I find predicting the difficulty of tasks exceedingly difficult. More often than not using the initially cheaper models requires me to reroll with a more expensive one or waste a lot of times and tokens cleaning up the subpar results. With OpenAI and Anthropic still subsiding tokens, not using the best models still seems like a tough ask.

What happens when you find the models are terrible? The claimed results don't match? My dev cycle tends to be write a test for blah blah, add feature to satisfy test, make sure tests pass.

For math, even the frontier has shortcomings, and there is a steep drop from GPT 5.5 xhigh to anything else. The time wasted by less-than-SotA just isn't worth it.

I've landed in a similar place by reducing effort and cutting up tasks. I find that more exacting specifications to the models, yield significantly less need for "effort". Combining each with multjple git worktrees and an integration branch for the current worktrees themselves has yielded incresible results.

This also allows me to play with, and mix codex, claude cli, and others. This is my happy spot for the last two months.

Yeah this is sounds close to my workflow and its good to hear you've find a nice flow too! It frees me up to spend that effort on doing more things in parallel and focusing way more on the specs which is usually a good idea anyway.

Because not every problem is a coding problem or not entirely solvable through code. Other tasks include legal, philosophical, financial, investigative, and combinations of these and others.

It doesn't look like that's where the conversation was going, though.

> I'm trying to wrap my head around exactly why so may people seem to want the best model available

To me this is a "more expectations mean more disappointment" situation.

Some people have higher expectations than others, and even the best model available is not good enough for what those people really want it to do once you start digging. In that light, the goal is not using the best model, but rather using the least insidiously deficient model.

Many people chase the edge because it's the least disappointing.

> when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.

The fatuousness of this statement pretty quickly becomes apparent if you spend more time looking at it, IMO, because the venn diagram of "damn good" and "not nearly good enough" strongly overlaps. Even the best model writing excellent lines of code still has noticeably deficient ability to decide which excellent lines of code to write. The goal is to improve the separation between them, not save a few dollars, because the emotional effort is worth more to us than the money.

> And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable.

Your minimization of performance differences and maximization of stability differences is exposing your biases.

Side note: I think you should know that to me at least some of what you said reads like self-rationalized moralizing. I couldn't help but imagine Principal Skinner saying "Am I so out of touch? No, it's the children who are wrong." People don't only want different things than you do because they don't know what they're doing.

I don't drive the best car available on the market. I don't own the fastest and best PC/Laptop/Smartphones available. I don't live in the best house in my city. I made reasonable choices that balance my needs and my available budget.

I think people are grouping into two flows.

One group is trying to get the LLM to basically one shot everything and not properly reviewing the output.

Others are using the LLM to assist their human intelligence in a tight loop.

If you’re doing the former you really do need the best model available because that’s still right on the edge of what LLMs can do at best, and at worst you’re just shipping pure unmaintainable slop.

If you’re doing the latter then you can get away with a slightly less powerful model without it making a material difference because your human intelligence is filling in gaps

The later takes too much mental ressources, the same when reviewing truly the code generated by the former.

I generally started by reviewing but after a while (maximum in hours), I just can't keep up and resort to LLMs as sole reviewers.

not many want to admit this

Well put. I belong to the latter group as I feed small, granular tasks that I describe thoroughly to the LLM. I tried, however, to just give it a bigger scope task. Even best models produce sloppy code.

While the single functions/classes/structs/... can be well though out the code tends to lack cohesion, and especially maintainability. For instance, it never thinks: "I could put this logic in an interface/trait so that if the requirements change I can simply add a concrete implementation that satisfies the new requirements (and potentially use one of these for testing)".

Yes that's also my experience.

SoTA models can do reasonably good jobs on each ticket, but over time the architecture of the application starts degrading without a human in the loop.

The entropy increases slower with better models but the trend is always towards slop

>why so many people seem to want the best model available

In my case, I rarely ever go over the Claude/ChatGPT subscription limits, so might as well use those considered-best models. If I had to generate millions of lines of code, maybe I would've used the open models more.

I agree, but there are use cases for the 'best model' other than converting your 1975 stuff to rust: for use cases where LLMs are just getting started to be useful I really want to use the current 'best' model: e.g. CAD, PCB design etc. In particular anything which requires spatial reasoning. The short time I had access to Fable 5 - it was just way better than any other model.

Except that there is no application for AI in CAD that is better, more appropriate, more robust or more sensible than learning how to use a CAD package and doing it yourself.

It's not fast-changing, it's not abstract, it's just not that difficult, and where it is difficult, the AI cannot help you, because it is not capable of things you are capable of.

Learn CAD yourself. Honestly; I was sure I would never manage to learn CAD but it turns out to be interesting, rewarding, valuable and actually quite quick to learn.

An LLM certainly is not going to be able to do it better than you once you have a tiny bit of experience. (PCB design, perhaps, has a language to it that an LLM can make a bit more headway into, but as a non-PCB-designer I would still bet that it's more like CAD than code)

This is a refreshing perspective because recently I feel like I’m surrounded by people who think they can effectively implement complex software, just by hammering the best models.

It has been hard to explain that they are in fact just creating toy versions and there is no way they can do it without learning the underlying architecture. But they just keep going wasting 100s of dollars , lost in a sea of bugs

Until a few years ago I'd have been the person who thought you could make a text-to-CAD system scale up to all of it. And then I tried to make stuff I wanted.

Dabbled with OpenSCAD as we will. I decided to learn FreeCAD and what I discovered is that, even putting aside FreeCAD's many documented issues, parametric GUI CAD is not an imprecise, clumsy or fiddly way to work.

It is expressive, precise, generally capable of all the things that code-CAD can do and much more, and it's much, much quicker to work in, once you've learned a few core principles.

As you say, there is an underlying architecture; it's not just a sort of 3D paint package.

The problems the text-as-whatever crowd have are all Dunning-Kruger things in the truest sense.

People who are unaware they are unskilled in a particular technology are unlikely to successfully replace it with another. Particularly one that requires describing the problem domain in precise language.

Quite often when you see text-to-CAD discussions, especially here, there's evidence of profound misunderstandings from the people who think they are going to automate it. They assume their frustrations with the tools stem from limitations of the tools, not from the limits of their understanding.

As a person with decades of experience of code I have found learning how to use LLMs effectively to be much, much harder than learning CAD.

For me, the 20€/months subscriptions were always sufficient, and it's nice if that subscription give the latest and greatest results.

It depends. Claude’s $20 plan is kind of a mess.

It's also geeks and engineers using these models and being the most vocal. We always think we're special and need the extra horsepower. Ever been on one of those home lab subreddits ? Same story.

> I'm trying to wrap my head around exactly why so may people seem to want the best model available

I've been programming since I was a kid. I enjoy it a lot, I like knowing how things work, I get excited about new compiler features, I stayed up every night for a week when I discovered Lean 4, etc etc etc.

At the same time I realized a few years ago that I just don't want to write any code ever. Or read any code. Coding is addictive and fun, but I'd rather talk to the computer and have things magically get done. (FWIW learning how to use LLMs feels more.. fulfilling, too)

Anyway. GLM 5.2 is nice and all, but I might have to spend half an hour guiding it to come up with a plan I'm happy with. And with Opus it could be 15 minutes. I'm still going to spend an hour talking to LLMs one way or the other, but with Opus it will be a less frustrating hour. If Fable gives me a frustration-free hour, I'll switch to Fable.

>> I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.

The reason is pretty simple and has to do with statistics: on long-horizon tasks, small errors and deviations from the "good path" compound.

What is your favorite harness for the open weights?

We built our own and aren't done open sourcing it but before that I got to a really good place with opencode plus some custom agents, pi family is good too although I haven't used it as much. We made an agent to design a spec, one to implement by dispatching subagents, one to validate against the plan, things like that. All of this helps claude/gpt too IME. For open models it has helped them stay out of loops (e.g. Kimi's but WAIT) and for frontier it helps them stay on task and not invent bloated patterns

pi is great for learning, oh-my-pi has all the nice things included that I've built fory pi previously.

pi-mono

What is pi-mono ? (I heard about pi)

Of course people want the best model available, even at 10x costs, if they are not paying for it. If the company is paying, why wouldn't you want a 2% better model?

That changes as soon as the developer is the one paying for a model. Then it's a classical engineering trade-off between money and quality, and that's where open models are clear winners.

> most halfway decent models can write damn good code for a fraction of the price

The problem isn't what they do in a blank state. It is how they get there and the edge cases. Some models also take longer (uses more steps) i.e. end up costing more despite being "cheaper".

I've seen models:

- Back out plans non-stop. Tried the obvious path. Invents X/Y/Z excuse (without verifying) that it can't be done. Notes that down and moves on. It could be as simple as site A being down and to download from site B but that's it.

- Hacks the test to make it work. Code is wrong? Nah, let's update the test.

- Keep saying useless things like YAGNI and infinite excuses like too risky to never do the work.

- Claims they are done but there's 100 edge cases not covered. When you try to use it it fails in ways you as a human assume it should work. You can write a spec to cover it all but then what's the point?

- Be trigger happy and never investigate. Tries to do it. 5 minutes. Oh it failed. Back out. Repeat. Better models definitely spend more time analyzing and actually "think". I've had models spend hours trying to do a change due to this method when an actual investigation (code walkthrough) might have solved it.

- Know and use the right tools. A lot of lesser models have infinite fear e.g. oh docker might not be available (it is) or this and that (even if you nudge it in any way) and spend a lot of extra time "working around" it.

The list goes on. Better models definitely help.

Only thing to agree on is no you don't need Fable but saying Sonnet can do the job instead of Opus is a different story. It's so obvious when Sonnet touches the code that I can't give it more than 5 minutes. It lies. Doesn't check. Forgets things and then messes up.

[flagged]