If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.
If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.
The scroes they're getting are on the order of 0-1% for this ARC-AGI-3 benchmark.
Didn’t I just see a post about 36% from someone?