I can tell you for a fact, Claude 4.7 was NOT doing what I told it to do (in fact the clear and complete opposite - repeatedly), a pretty simple architectural refactor, and that Codex did better and DeepSeek much better.
It was given very simple ways to verify success. It simply didn't do that and said it's at a good stopping point, despite moving in the WRONG direction not even doing 1% of the task, and being told to see the task through to completion.
Meanwhile, Codex broke it down into 3 steps and just got it done...
No, "I'm going to give it to you straight, this is a large risky commit that could go sideways, so I'm just not going to do anything instead."
Claude worked on it for almost 200 commits over 2 weeks, needing to typically prompt it 3x to even TRY to make any progress instead of just wasting tokens to ignore me and tell me how big and risky it is.
Maybe Claude is just particularly terrible at this type of refactor. I'm not sure why that would be.