Aren't there benchmarks that measure at the harness level as well?

How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.