Every plan Opus creates in Planning mode gets run through ChatGPT 5.2. It catches at least 3 or 4 serious issues that Claude didn’t think of. It typically takes 2 or 3 back and fourths for Claude to ultimately get it right.
I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.
The same thing can be said about Opus running through Opus.
It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.
My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality. This is on a 140K LOC codebase with an unreasonable amount of time spent on rules (lint, format, commit, etc), on specifying coding patterns, on documenting per workspace README.md, etc.
That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.
While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.
Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.
How exactly do you plan/initiate a review from the terminal? open up a new shell/instance of claude and initiate the review with fresh context?
Yeah. It dumps context into various .md files, like TODO.md.
Thanks for the tip. I was dubious, I tried GPT 5.2 for a start on a large plan and it was way better than reviewing it with Claude itself or Gemini. I then used it to help me with feature I was reviewing, it caught real discrepancies between the plan and the actual integration!
This makes me think: are there any "pair-programming" vibecoding tools that would use two different models and have them check each other?