Most interesting things to me from their benchmarks:

GPT does way worse than Opus without their harness, but better with it.

Opus 4.7 and 4.8 do way worse than 4.6. (Intentional nerfing?)

Would have been interesting to see GLM in the custom harness.

Would also be interesting to run GLM in Claude Code, which it has presumably been fine tuned on.