> The improvements to programming (IME) haven’t come from improved models, they’ve come from agents, tooling, and environment integrations.

I disagree. This almost entirely model capability increases. I've stated this elsewhere: https://news.ycombinator.com/item?id=46362342

Improved tooling/agent scaffolds, whatever, are symptoms of improved model capabilities, not the cause of better capabilities. You put a 2023-era model such as GPT-4 or even e.g. a 2024-era model such as Sonnet 3.5 in today's tooling and they would crash and burn.

The scaffolding and tooling for these models have been tried ever since GPT-3 came out in 2020 in different forms and prototypes. The only reason they're taking off in 2025 is that models are finally capable enough to use them.

Yet when you compare the same model in 2 different agents you can easily see capability differences. But cross (same tier) model in the same agent is much less stark.

My personal opinion is that there was a threshold earlier this year where the models got basically competent enough to be used for serious programming work. But all the major on the ground improvements since then has gone from the agents, and not all agents are equal, while all sota models are effectively.

> Yet when you compare the same model in 2 different agents you can easily see capability differences.

Yes definitely. But this is to be expected. Heck take the same person and put them in two different environments and they'll have very different performance!

> But cross (same tier) model in the same agent is much less stark.

Unclear what you mean by this. I do agree that the big three companies (OpenAI, Anthropic, Google DeepMind) are all more or less neck and neck in SOTA models, but every new generation has been a leap. They just keep leaping over each other.

If you compare e.g. Opus 4.1 and Opus 4.5 in the same agent harness, Opus 4.5 is way better. If you compare Gemini 3 Pro and Gemini 2.5 Pro in the same agent harness, Gemini 3 is way better. I don't do much coding or benchmarking with OpenAI's family of models, but anecdotally have heard the same thing going from GPT-5 to GPT-5.2.

The on the ground improvements have been coming primarily from model improvements, not harness improvements (the latter is unlocked by the former). Again, it's not that there were breakthroughs in agent frameworks that happened; all the ideas we're seeing now have all been tried before. Models simply weren't capable enough to actually use them. It's just that more and more (pre-tried!) frameworks are starting to make sense now. Indeed, there are certain frameworks and workflows that simply did not make sense with Q2-Q3 2025 models that now make sense with Q4 2025 models.

I actually have spent a lot of time doing comparisons between the 4.1 and 4.5 Claude models (and lately the 5.1->5.2 chatgpt models) and for many many tasks there is not significant improvement.

All things being equal I agree that the models are improving, but for many of the tasks I’m testing what has the most improvement is the agent. The agents choosing the appropriate model for the task for instance has been huge.

I do believe there is beneficial symbiosis but for my results the agent's provide much bigger variance than the model.