> However I guess that at least some of that can be mitigated by distilling out a system description and then running agents again to refactor the entire thing.

The problem with this is that the code is the spec. There are 1000 times more decisions made in the implementation details than are ever going to be recorded in a test suite or a spec.

The only way for that to work differently is if the spec is as complex as the code and at that level what’s the point.

With what you’re describing, every time you regenerate the whole thing you’re going to get different behavior, which is just madness.