> Solving hard math problems requires understanding the structure of mathematical reasoning
Not when you already know all of the answers and just have to draw a line between the questions and the answers!
> Solving hard math problems requires understanding the structure of mathematical reasoning
Not when you already know all of the answers and just have to draw a line between the questions and the answers!
Please check out the post on Math-Perturb-Hard conveniently linked to above before making a comment without responding to it.
A relevant bit:
“for MATH-P-Hard, we make hard perturbations, i.e., small but fundamental modifications to the problem so that the modified problem cannot be solved using the same method as the original problem. Instead, it requires deeper math understanding and harder problem-solving skills.”
Seems like that would explain why it scored 10%, not 100%, to me
A child could score the same knowing the outcomes and guessing randomly which ones go to which questions
My request:
“Could you explain this sentence concisely?
For the skeptics: Scoring just 10% or so in Math-Perturb-Hard below the original MATH Level 5 (hardest) dataset seems in line with or actually better than most people would do.”
Gemini 2.5 Pro:
“The sentence argues that even if a model's score drops by about 10% on the "Math-Perturb-Hard" dataset compared to the original "MATH Level 5" (hardest) dataset, this is actually a reasonable, perhaps even good, outcome. It suggests this performance decrease is likely similar to or better than how most humans would perform when facing such modified, difficult math problems.”
I think 'nopinsight' and the paper are arguing that the drop is 10%, not that the final score is 10%. For example, Deepseek-R1 dropped from 96.30 to 85.19. Are you actually arguing that a child guessing randomly would be able to score the same, or was this a misunderstanding?