My experience with both Opus and GPT-codex is that they both just forget to implement big chunks of specs, unless you give them the means to self-validate their spec conformance. I’m finding myself sometimes spending more time coming up with tooling to enable this, than the actual work.
The key is generating a task list from the spec. Kiro IDE (not cli) generates tasks.md automatically. This is a checklist that Opus has to check off.
Try Kiro. It's just an all-round excellent spec-driven IDE.
You can still use Claude Code to implement code from the spec, but Kiro is far better at generating the specs.
p.s. if you don't use Kiro (though I recommend it), there’s a new way too — Yegge’s beads. After you install, prompt Claude Code to `write the plan in epics, stories and tasks in beads`. Opus will -- through tool use -- ensure every bead is implemented. But this is a more high variance approach -- whereas Kiro is much more systematic.
I’ve even built my own todo tool in zig, which is backed by SQLite and allows arbitrary levels of todo hierarchy. Those clankers just start ignoring tasks or checking them off with a wontfix comment the first time they hit adversity. Codex is better at this because it keeps going at hard problems. But then it compacts so many times over that it forgets the todo instructions.
I just use beads or Github issues. Plan/spec first, split it into issues.
Then reset context and implement each task one by one. Nothing gets forgotten.
A key part of this is making sure bite sized issues reference any related holistic concerns like code quality, testing, documentation style, commit strategy, git workflow etc.
In my experience you have to refer to the relevant docs on things explicitly in every single issue for it to work well.