That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.