Even with mutation testing doesn’t this still require review of the testing code?

Mutation is a test for the test suite. The question is whether a change to the program is detected by the tests. If it's not, the test suite lacks coverage. That's a high standard for test suites, and requires heavy testing of the obvious.

But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.

Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.

Correct. Where did the engineering go? First it was in code files. Then it went to prompts, followed by context, and then agent harnesses. I think the engineering has gone into architecture and testing now.

We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.