For me it's simple: even the best models are "lazy" and will confidently declare they're finished when they're obviously not, and the immensely increased amount of training effort to get ChatGPT 5's mild improvements on benchmarks suggests that that quality won't go away anytime soon.
Sounds like it's partially about a nuanced trade-off. It can just as well be too eager and add changes I didn't ask for. Being lazy is better than continuing on a bad path.
There's a long distance between "nuanced behavior" and what it actually does now, which is "complete 6 items of an explicit 10-item task list and then ask the user again if they want to continue".
gpt-5 is extremely cheap, what makes you think they couldn't produce a larger, smarter, more expensive model?
gpt-5 was created to be able to service 200m daily active users.
> what makes you think they couldn't produce a larger, smarter, more expensive model?
Because they already did try making a much larger, more expensive model, it was called GPT-4.5. It failed, it wasn't actually that much smarter despite being insanely expensive, and they retired it after a few months.
that was not a reasoning model.
None of them are reasoning models. Some of them have a loop of word-outputting.