I actually have spent a lot of time doing comparisons between the 4.1 and 4.5 Claude models (and lately the 5.1->5.2 chatgpt models) and for many many tasks there is not significant improvement.
All things being equal I agree that the models are improving, but for many of the tasks I’m testing what has the most improvement is the agent. The agents choosing the appropriate model for the task for instance has been huge.
I do believe there is beneficial symbiosis but for my results the agent's provide much bigger variance than the model.