I’ve been living this experience and using latest models in work throughout this time. The failure modes of LLMs have not fundamentally changed. The makers are not awfully transparent about what exactly they change in each model release the same way you know what changed in i.e., a new Django version. But there’s not been a paradigm shift. I believe/guess (from outside) the big change you think you’re experiencing could be result of many things like better post training processes (RLHF) for models to run a predefined set of commands like always running tests, or other marginal improvements to the models and focusing on programming tasks. To be clear these improvements are welcome and useful, just not the groundbreaking change some claim.