I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.

> At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing.

Is that true? I find the smarter models can just be effective when smaller models can't. It isn't a matter of just waiting longer.

it's almost certainly not true yet but at some point there might be an equilibrium reached of speed Vs quality (and let's not forget, cost) where it's true for most of what you do.

Perhaps you'd still turn to hosted models for the hardest tasks, but most tasks go local. It does seem like that would make demand go down significantly.

Of course that's all predicated on model advances plateauing, or at least getting increasingly more expensive for incremental improvements, such that local open source models can catch up on that speed/quality/cost curve. But there is a fair amount of evidence that's happening. The models are still getting noticably better, but relative improvement does seem to be slowing, and cost is seemingly only going up.

Why is this presumed to be de facto inevitable:

* local compute isn’t scaling as before, so algorithmic improvements are the only ways models get meaningfully faster and smarter

* all those same algorithmic improvements would also be true for larger models

* hardware manufacturers have an incentive against local LLMs because cloud LLMs are so much more lucrative (+ corps would by desktop variants if they were good enough)

So no it’s not clear quality will ever be comparable. It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

> It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

Sure, but if the “good enough for what you want” consumes the vast majority of cases - data-center ai becomes just for the very extreme edge cases. Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

> all those same algorithmic improvements would also be true for larger models

Smaller models run faster. If ten runs of a small model gets me the same quality result as one run of the big model, and the small model runs 10x faster, then they are functionally the same.

> Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

This is a very nice analogy actually and it impacts the whole story about US vs. Chinese leadership in "frontier AI".

> I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

It's always going to be cost;

developer time vs developer cost vs AI cost vs developer productivity.

With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.

Kilo (the open source coding agent) tested Deepseek v4 Pro and Flash vs Opus 4.7 and Kimi K2[1].

It did ok, but scored substantially less than Opus. It also cost nearly as much, even with the current launch promo pricing for Deepseek.

That cost is interesting - I've seen similar things with Sonnet vs Opus, and in my own benchmarking there are some models that benchmark well, seem to have a good price but use so many tokens they cost just as much as "more expensive" models.

[1] https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash

Their pricing shown is without the discount.

> With DeepSeek’s 75% promo applied to current rates, the same run would have cost closer to $0.55, putting it below Kimi K2.6 in absolute cost while scoring 9 points higher.

I will be sad when the discount ends.

Oh misread that sorry!

I imagine we'll get to "good enough" for hobbyist programmers fairly quickly, but businesses will still be willing to pay more for faster and smarter. Why make your programmers wait?

> Why make your programmers wait?

That depends on where the methodology goes. But more and more it's hands off. If the trajectory continues it won't matter because nobody is sitting their waiting / watching the LLM code anyway. It is all happening in the background. We might see hybrid approaches where the weaker / cheaper agent tries to solve it and just "asks for help" from the more expensive agent when it needs it etc.

> nobody is sitting their waiting / watching the LLM code anyway

My personal experience is that for production-grade code you need to steer the agent more often than not... so yes, at least some of us are watching the LLM code.