What I meant by "ground truth" is that it is not fuzzy, not AI-evaluated, and not a consensus. The test suite passes or it doesn't. The codebase lints or it doesn't. The performance improved or it didn't.
An agent can help you create the specification, but it's up to you to know whether it's correctly testing that you got the result you wanted.