I tend to agree,

If you have invested significantly in the planning phase and there is momentum in the architecture and conventions that already exist in the project, the implementation phase might not need as much oversight as is suggested here.

> You can discover that your initial idea was dumb and a better one exists

The planning and architecture phase is usually where I make these types of discovery at a high level.

> Your agent might go “off the rails” and start doing something you don’t want it to do

Candidly these orthogonal, inadvertent edits aren't as bad as they once were and for impactful changes there should be at least some test coverage, even if that test coverage is just "freezing" what was implemented.

As you mentioned the final review discussion is a good chance to verify beyond what review or adversarial review agents find.

I think the obvious solution here is to beef up the test side of the app, much more than when writing code by hand. Tests represent project knowledge in executable format. The LLM does not need to be careful to remember every detail of the tests. You don't need to vet every small interaction, it automates review work as well.

Even better if the project was built from the start to be easier to test and observe. But my golden rule remains - no code without tests, expand test suite all the time.

I agree, human-steered, AI-implemented test cases can at least capture the acceptance criteria.

It's then more efficient to inspect if existing test cases are being modified as part of the delivery of something new and inspect why.