I think the games tasks are worth exploring more. If you look at that recent Pokemon post - it's not as capable as a high school student - it took a long, long time. I have a private set of tests, that any 8 year old could easily solve that any LLM just absolutely fails on. I suspect that plenty of the people claiming AGI isn't here yet have similar personal tests.
Arc-Agi 3 is coming soon, I'm very excited for that because it's a true test of multimodality, spatial reasoning, and goal planning. I think there was some preliminary post somewhere that did show that current models basically try to brute-force their way through and don't actually "learn the rules of the game" as efficiently as humans do.
How do you think they are training for the spatial part of the tests? It doesn’t seem to lend itself well to token based “reasoning”. I wonder if they are just synthetically creating training data and hope a new emergent spatial reason ability appears.
>think they are training for the spatial part of the tests
I'm not sure the party that "they" is referring to here, since arc-agi-3 dataset isn't released yet and labs probably have not begun targeting it. For arc-agi-2, possibly just synthetic data might have been enough to saturate the benchmark, since most frontier models do well on it yet we haven't seen any corresponding jump in multimodal skill use, with maybe the exception of "nano banana".
>lend itself well to token based “reasoning”
One could perhaps do reasoning/COT with vision tokens instead of just text tokens. Or reasoning in latent space which I guess might be even better. There have been papers on both, but I don't know if it's an approach that scales. Regardless gemini 3 / nano banana have had big gains on visual and spatial reasoning, so they must have done something to get multimodality with cross-domain transfer in a way that 4o/gpt-image wasn't able to.
For arc-agi-3, the missing pieces seem to be both "temporal reasoning" and efficient in-context learning. If they can train for this, it'd have benefits for things like tool-calling as well, which is why it's an exciting benchmark.