I think the way to see this as the organic process of discovering hard-to-game benchmarks. The loop is:
1. People discover things LLMs can kind of do, but very poorly.
2. Frontier labs sample these discoveries and incorporate them into benchmarks to monitor internally.
3. Next generation model improves on said benchmarks, and the improvements generalize to improvements on loosely correlated real world tasks.