Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".
PS: yes, I might or might not have a degree in corporate strategy & PR.
That is an effect but it’s not a nail in the coffin. There are lots of proprietary benchmarks on real product traffic that aren’t contaminated and open questions as well. People at these labs largely know what they are doing, it’s not like people don’t know this.
Is this not true of human intelligence as well? Many smart people I know hold beliefs that have no obvious truth value.