>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality.
Really ? This happens plenty with human testing. Humans aren't general ?
The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.
I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.
Just because humans are usually tested in a particular way that allows them to make up for a lack of generality with an outstanding performance in their specialization doesn't mean that is a good way to test generalization itself.
Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.
For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:
https://arxiv.org/abs/1911.01547
>Apparently someone here doesn't know how outliers affect a mean.
If the concern is that easy questions distort the mean, then the obvious fix is to reduce the proportion of easy questions, not to invent a convoluted scoring method to compensate for them after the fact. Standardized testing has dealt with this issue for a long time, and there’s a reason most systems do not handle it the way ARC-AGI 3 does. Francois is not smarter than all those people, and certainly neither are you.
This shouldn't be hard to understand.