This is correct, but I'd say there's something beyond that that's more specific about Codex + GPT models though. They've done some sort of training that makes it far more diligent about seeking out data races, unhandled errors / negative cases, and missing test coverage than the other models I've played with. It also seems more prone to testing its hypothesis.

This makes it slower to work with for prototyping, and it will, if not properly disciplined, litter your code with "legacy adapters" and "bridge code" and temporary incremental refactoring steps [arguably not terrible for work in real commercial software projects]. And it will create too many unit & integration tests, if you're not careful.

But it does, in my opinion, tend to produce more reliable software and I trust it far more than I did when I was working in Claude.

When I could afford it, I had both plans running, Claude to produce new features, and then Codex to brutally critique it battle test it, sharpen the edges, and produce better tests, and this flow went extremely well.

Now I just work with Codex and various open models.