Out of curiosity, did you give a test for them to validate the code?
I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).
There was a test and a very useful golang error that literally explain what was wrong. The model tried implementing a solution, failed and when I pointed out the error most of them just rolled back the "solution"
What exact models were you using? And with what settings? 4.6 / 5.3 codex both with thinking / high modes?
Ok, thanks for the info