The "AI implementation" step in my workflow includes separate agents dedicated to testing and reviewing changes. The self feedback loop catches a lot of errors and mistakes, but it rarely produces working code in one go.
In my experience, the generated code handles the happy path, but isn't great about edge cases or writing clean code, even with explicit instruction in the initial prompt.
We usually end up doing multiple iterations with what claude/codex output, pointing out issues, asking for changes, etc.