My request:
“Could you explain this sentence concisely?
For the skeptics: Scoring just 10% or so in Math-Perturb-Hard below the original MATH Level 5 (hardest) dataset seems in line with or actually better than most people would do.”
Gemini 2.5 Pro:
“The sentence argues that even if a model's score drops by about 10% on the "Math-Perturb-Hard" dataset compared to the original "MATH Level 5" (hardest) dataset, this is actually a reasonable, perhaps even good, outcome. It suggests this performance decrease is likely similar to or better than how most humans would perform when facing such modified, difficult math problems.”