So, if you look at the way the scoring works, 100% is the max. For each task, you get full credit if you solve in a number of steps less than or equal to the baseline. If you solve it with more steps, you get points off. But each task is scored independently, and you can't "make up" for solving one slowly by solving another quickly.
Like suppose there were only two tasks, each with a baseline score of solving in 100 steps. You come along and you solve one in only 50 steps, and the other in 200 steps. You might hope that since you solved one twice as quickly as the baseline, but the other twice as slowly, those would balance out and you'd get full credit. Instead, your scores are 1.0 for the first task, and 0.25 (scoring is quadratic) for the second task, and your total benchmark score is a mere 0.625.
The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks.
>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality.
Really ? This happens plenty with human testing. Humans aren't general ?
The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.
I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.
Just because humans are usually tested in a particular way that allows them to make up for a lack of generality with an outstanding performance in their specialization doesn't mean that is a good way to test generalization itself.
Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.
For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:
https://arxiv.org/abs/1911.01547
>Apparently someone here doesn't know how outliers affect a mean.
If the concern is that easy questions distort the mean, then the obvious fix is to reduce the proportion of easy questions, not to invent a convoluted scoring method to compensate for them after the fact. Standardized testing has dealt with this issue for a long time, and there’s a reason most systems do not handle it the way ARC-AGI 3 does. Francois is not smarter than all those people, and certainly neither are you.
This shouldn't be hard to understand.