And Fable is still worse than Codex.

I use both and the only thing (as always) that I will use Claude for is UI design.

Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.

Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.

I don't know what problems you're working on but Fable is not just better, it is a step change from GPT 5.5 in my experience. It feels at least one major model generation ahead.

One Hacker News commenter says it's worse, another retorts it's a step change and even includes emphasis! Will the first commentor retort back that it's been a double dog step change in the opposite direction? Can't wait to see how this comment thread unfolds!

It doesn't for me. I use Fable to make plans, then give them to GPT 5.5 to review, and it always finds flaws and edge cases that Fable misses (some are really critical). It was the same with Opus 4.8. I'll admit it finds a bit fewer issues now, but Fable feels more like an incremental improvement than a major generation ahead.

For that test you have to compare letting a fresh agent (subagent) or the same model do the same review.

The fact that a review helps does not prove the model choice for the review.

You reviewing your own writing helps too!

This is exactly what I find too, I make plans in both models and compare them in the other model. And Claude usually agrees (65-80% of the time) that the Codex plan included things it didn't think of, or was better in some other way.

Note, this is better than it was with Opus, where it was more like 90% of the time the Codex plans were obviously better.

Curious, which model do you use for Codex? I'm very happy with the solutions '5.5 high' finds. It's like it understands exactly what I mean and it also anticipates all sorts of situations. Before I used '5.5 medium' for some time and it was a bit underwhelming. It may sound funny but it's like it didn't care that much to do a good job.

I use GPT 5.5 High Fast, I often benchmark versus Fable (and previously did versus Opus) and it's night and day.

Claude still (and has always) writes far too much code to fulfill a given spec or plan. It misses edge cases and is generally far too verbose.

Claude also is (and even more so with Fable) super tokenmaxxing, i.e. it seems tuned to use the max amount of tokens per task, whereas Codex will simply get your job done as you specified with the minimum fuss and tokens.

Codex feels way more steerable and just more "professional" as though I'm working with a seasoned engineer, versus someone smart but over excitable, like a super smart associate engineer.

What are your harnesses? Do you have the same skillsets/tools/etc for both?

I use Codex and Claude Code. I've used both Codex and CC since release with basically every model they've ever released, I always try both for almost every plan that I write and benchmark the plans against each other, Claude almost always acknowledges that the Codex plan is better! Even now with Fable, this still happens.

As in, I give the exact same prompt to Fable and GPT 5.5 Pro, then produce the plans, then give each model the other's plan. Claude always realizes it missed stuff and Codex usually ends up finding missing things in Claudes plan.

This situation did improve with Fable versus Opus 4.8, but in general, Codex for me is still the better model.

In my experience writing about 50 programs with fable, opus, and GPT, fable is a significant step change better than opus which is significantly better than GPT. We must be doing different things.

From what I’ve seen all three are close enough that I would be hard pressed to pick one. It seems to matter much more how I prompt than which of the three I am using.

I'm writing low-level Rust, distributed systems, also sandboxing tech which has to be secure and performant.

The only thing I have Fable do now is create UIs or otherwise front-ends for systems where correctness doesn't matter as much.

Anthropic models lead at making nice looking UIs for sure, but when it comes to making sure my Rust code is actually 100% correct and uses 1% of CPU most of the time, Codex is king.

definitely not in my experience. I usually write distributed systems and back end code, and Fable is so much better at those than Codex that it's not even a comparison. Fable feels like it's a year ahead.

Interesting, I’d love to see the comparisons of your system using Claude vs Codex. I have about 20 years of experience in distributed systems and super high scale at several faangs, and also building ai model serving infra for 20k transactions per second roughly.

For me, Claude makes bone headed decisions all the time, like glaring errors, not even particularly subtle.

But the more obvious flag is the amount of irrelevant code and tests which Fable writes. Like it regularly writes 2X or 3X the amount of code and tests that are needed. It’s an expert at writing plausible but entirely useless tests.

But I think that if you’re a more junior engineer or haven’t been around a the block you can easily think that “more code equals smarter”. Claude ends up creating a massive, hard to manage codebase, and if you look the Claude Code codebase (which was leaked), you can see I’m right!

The Claude Code codebase is terrible. And presumably Anthropic has been using their smartest models for working on Claude Code. I wrote my own coding harness with Codex (as a fun experiment) which used a fraction of the code and is about 100X more performant and memory efficient (than Claude Code)!