Everyone is circling around this. We are shifting to "code factories" that take user intent in at one end and crank out code at the other end. The big question: can you trust it?
We're building our tooling around it (thanks, Claude!) and seeing what works. Personally, I have my own harness and I've been focused on 1) discovering issues (in the broadest sense) and 2) categorizing the issues into "hard" and "easy" to solve inside the pipeline itself.
I found patterns in the errors the coding agents made in my harness, which I then exploited. I have an automated workflow that produces code in stages. I added structured checks to catch the "easy" problems at stage boundaries. It fixes those automatically. It escalates the "hard" problems to me.
In the end, this structure took me from ~73% first-pass to over 90%.