Interesting, I’d love to see the comparisons of your system using Claude vs Codex. I have about 20 years of experience in distributed systems and super high scale at several faangs, and also building ai model serving infra for 20k transactions per second roughly.

For me, Claude makes bone headed decisions all the time, like glaring errors, not even particularly subtle.

But the more obvious flag is the amount of irrelevant code and tests which Fable writes. Like it regularly writes 2X or 3X the amount of code and tests that are needed. It’s an expert at writing plausible but entirely useless tests.

But I think that if you’re a more junior engineer or haven’t been around a the block you can easily think that “more code equals smarter”. Claude ends up creating a massive, hard to manage codebase, and if you look the Claude Code codebase (which was leaked), you can see I’m right!

The Claude Code codebase is terrible. And presumably Anthropic has been using their smartest models for working on Claude Code. I wrote my own coding harness with Codex (as a fun experiment) which used a fraction of the code and is about 100X more performant and memory efficient (than Claude Code)!