I also find the subjective nature of these models interesting, but in this case the difference in my experiences between Sonnet 4.5 and GPT-5 Codex, and especially GPT-5 Pro, for code review is pretty stark. GPT-5 is consistently much better at hard logic problems, which code review often involves.
I have had GPT-5 point out dozens of complex bugs to me. Often in these cases I will try to see if other models can spot the same problems, and Gemini has occasionally but the Claude models never have (using Opus 4, 4.1, and Sonnet 4.5). These are bugs like complex race conditions or deadlocks that involve complex interactions between different parts of the codebase. GPT-5 and Gemini can spot these types of bugs with a decent accuracy, while I’ve never had Claude point out a bug like this.
If you haven’t tried it, I would try the codex /review feature and compare its results to asking Sonnet to do a review. For me, the difference is very clear for code review. For actual coding tasks, both models are much more varied, but for code review I’ve never had an instance where Claude pointed out a serious bug that GPT-5 missed. And I use these tools for code review all the time.
I've noticed something similar. I've been working on some concurrency libraries for elixir and Claude constantly gets things wrong, but GPT5 can recognize the techniques I'm using and the tradeoffs.