Hacker News

If we're evaluating a person, rote recall is not necessarily cheating. It's expected, but then you'd expect them to apply that rote-memorized information in a novel way later on and prove they understand how they applied their priors to the new situation.

Models don't actually reason in the same sense, so recalling rote from their training data is "cheating" in the sense that the training data cheated, not the model. So many of those benches have snaked their way into training data to make them less useful benchmarks. That, I think, is going to be a long-term difficulty in quantitatively assessing model quality and "intelligence." So it is cheating, in a sense of what we expect from the models and training data, but not in a human sense.