The "no meaningful benchmark for good agentic session performance" point resonates. Success varies so much by task type that a single metric is almost meaningless. A 60-second documentation lookup and a 30-minute refactoring session could both be successes.

Curious what shape the benchmark takes. Are you thinking per-task-type baselines, or something more like an aggregate efficiency score?