Solving hard math problems requires understanding the structure of complex mathematical reasoning. No animal is known to be capable of that.

Most definitions and measurements of intelligence by most laypeople and psychologists include the ability to reason, with mathematical reasoning widely accepted as part of or a proxy for it. They are imperfect but “intelligence” does not have a universally accepted definition.

Do you have a better measurement or definition?

Math is a contrived system though, there are no fundamental laws of nature that require math to be done the way we do it.

A human society may develop their own math in a base 13 system, or an entirely different way of representing the same concepts. When they can't solve our base 10 math problems in a way that matches how we expect does that mean they are parrots?

Part of the problem here is that we still have yet to land on a clear, standard definition of intelligence that most people agree with. We could look to IQ, and all of its problems, but then we should be giving LLMs an IQ test to answer rather than a math test.

The fact that much of physics can be so elegantly described by math suggests the structures of our math could be quite universal, at least in our universe.

Check out the problems in the MATH dataset, especially Level 5 problems. They are fairly advanced (by most people’s standards) and most are not dependent on which N in the base-N system used to solve them. The answers would be different of course but the structures of the problems and solutions remain largely intact.

Website for tracking IQ measurements of LLMs:

https://www.trackingai.org/

The best one already scores higher than all but the top 10-20% of most populations.

> Solving hard math problems requires understanding the structure of complex mathematical reasoning. No animal is known to be capable of that.

Except, it doesn't. Maybe some math problems do -- or maybe all of them do, when the text isn't in the training set -- but it turns out that most problems can be solved by a machine that regurgitates text, randomly, from all the math problems ever written down.

One of the ways that this debate ends in a boring cul-de-sac is that people leap to conclusions about the meaning of the challenges that they're using to define intelligence. "The problem has only been solved by humans before", they exclaim, "therefore, the solution of the problem by machine is a demonstration of human intelligence!"

We know from first principles what transformer architectures are doing. If the problem can be solved within the constraints of that simple architecture, then by definition, the problem is insufficient to define the limits of capability of a more complex system. It's very tempting to instead conclude that the system is demonstrating mysterious voodoo emergent behavior, but that's a bit like concluding that the magician really did saw the girl in half.

> Solving hard math problems requires understanding the structure of mathematical reasoning

Not when you already know all of the answers and just have to draw a line between the questions and the answers!

Please check out the post on Math-Perturb-Hard conveniently linked to above before making a comment without responding to it.

A relevant bit:

“for MATH-P-Hard, we make hard perturbations, i.e., small but fundamental modifications to the problem so that the modified problem cannot be solved using the same method as the original problem. Instead, it requires deeper math understanding and harder problem-solving skills.”

Seems like that would explain why it scored 10%, not 100%, to me

A child could score the same knowing the outcomes and guessing randomly which ones go to which questions

My request:

“Could you explain this sentence concisely?

For the skeptics: Scoring just 10% or so in Math-Perturb-Hard below the original MATH Level 5 (hardest) dataset seems in line with or actually better than most people would do.”

Gemini 2.5 Pro:

“The sentence argues that even if a model's score drops by about 10% on the "Math-Perturb-Hard" dataset compared to the original "MATH Level 5" (hardest) dataset, this is actually a reasonable, perhaps even good, outcome. It suggests this performance decrease is likely similar to or better than how most humans would perform when facing such modified, difficult math problems.”

I think 'nopinsight' and the paper are arguing that the drop is 10%, not that the final score is 10%. For example, Deepseek-R1 dropped from 96.30 to 85.19. Are you actually arguing that a child guessing randomly would be able to score the same, or was this a misunderstanding?