honestly the most interesting thing about ARC-AGI-3 isn't the 0.25% scores everyone is doomposting about. it's the Duke harness result.
if you give Opus just three generic tools (READ, GREP, BASH with Python) and literally zero game-specific help, it completes all three preview games in 1,069 actions. for comparison, humans do it in like ~900. that's actually insane. it writes its own BFS, builds a grid parser from scratch, and even solves a Lights Out puzzle with Gaussian elimination. all on its own.
i really think the benchmark is testing two different things and just smashing them together. can the model reason about novel interactive environments? yeah, clearly it can. can it do spatial reasoning over a 64x64 grid from raw JSON with zero tools? no. but then again, neither can a human if you ripped out their visual cortex lol.
humans come "pre-installed" with specialized subsystems for this exact stuff: a visual cortex for spatial perception, a hippocampus for persistent memory, etc. these aren't "tools" in Chollet's framing but they're basically identical to what the Duke harness provides. the model is just building its own version of those (Python for the cortex, grep for memory). it just needs the permission to build them.
the real gap the Duke team found isn't perception or memory anyway, it is hypothesis quality. some runs solve vc33 in 441 actions, others just plateau past 1,500. the variance is just down to whether the model commits early to the right explanation of how the game works. that's a way more interesting and targetable finding than just saying "frontier models score below 1%."
Chollet is probably right philsophically that AGI should handle any input format without help. but reporting 0.25% when the actual reasoning gap is in hypothesis formation (not spatial perception) makes the benchmark a way worse progress indicator than it could be imo.