It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.
It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.
It's not true that there was no improvement in the rate at which models produced quality code.
Jan 2025 was Claude 3.5 Sonnet, Gemini 1.5 Pro and OpenAI had GPT-4o.
As someone who used all those models, as well as today's frontier models - today's models are a significant step up from those.
This is likely true. I think model quality has stagnated and that its likely a non-trivial task to find a new improvement vector. Scaling the width of the model (which has been the driving force behind the speed of improvement thus far) seems to have reached its limit.
It will be interesting to see the implications of this. Tooling can only do so much in the long term.
How do you know that width scaling has been the driving force of improvement?
I am no insider and have never even tried to build an LLM, so I can only guess. But the general sentiment seems to be that this is the case. If you are interested, I would recommend you read the MIT paper "Superposition Yields Robust Neural Scaling" [0]. It confirms an interesting trend: models represent more features/concepts than they have clean independent dimensions, so features overlap. Increasing model dimension reduces this geometric interference, which lowers loss in a predictable way, but with diminishing returns.
This has, in my opinion, likely been the primary vector in getting better models thus far, but MIT mathematically proves that it yields diminishing returns for each new dimension added. It will get more and more expensive and the cost-return will or probably already has made it infeasible.
Ilya appear to support sentiment this as well. [1]
[0] - https://openreview.net/forum?id=knPz7gtjPW [1] - https://www.businessinsider.com/openai-cofounder-ilya-sutske...
I mean, it's not exactly a PhD level question. One can infer from the extreme demand of GPUs and DRAM + new data center construction that all the providers are banking on width.
No? That could just be fomo, actual adoption, or a number of other things.
But, that's an enormous source of coding productivity, and it's why Anthropic is worth billions... The reason SWE-bench has been so successful and useful for coding is that software engineering has a ton of tradition and infrastructure for making and using automated tests.
maybe this is why these companies pricing plans are getting more limited and expensive..