We seem to be moving the goalposts on AGI, are we not? 5 years ago, the argument that AGI wasn't here yet was that you couldn't take something like AlphaGo and use it to play chess. If you wanted that, you had to do a new training run with new training data.
But now, we have LLMs that can reliably beat video games like Pokemon, without any specialized training for playing video games. And those same LLMs can write code, do math, write poetry, be language tutors, find optimal flight routes from one city to another during the busy Christmas season, etc.
How does that not fit the definition of "General Intelligence"? It's literally as capable as a high school student for almost any general task you throw it at.
I think the games tasks are worth exploring more. If you look at that recent Pokemon post - it's not as capable as a high school student - it took a long, long time. I have a private set of tests, that any 8 year old could easily solve that any LLM just absolutely fails on. I suspect that plenty of the people claiming AGI isn't here yet have similar personal tests.
Arc-Agi 3 is coming soon, I'm very excited for that because it's a true test of multimodality, spatial reasoning, and goal planning. I think there was some preliminary post somewhere that did show that current models basically try to brute-force their way through and don't actually "learn the rules of the game" as efficiently as humans do.
How do you think they are training for the spatial part of the tests? It doesn’t seem to lend itself well to token based “reasoning”. I wonder if they are just synthetically creating training data and hope a new emergent spatial reason ability appears.
>think they are training for the spatial part of the tests
I'm not sure the party that "they" is referring to here, since arc-agi-3 dataset isn't released yet and labs probably have not begun targeting it. For arc-agi-2, possibly just synthetic data might have been enough to saturate the benchmark, since most frontier models do well on it yet we haven't seen any corresponding jump in multimodal skill use, with maybe the exception of "nano banana".
>lend itself well to token based “reasoning”
One could perhaps do reasoning/COT with vision tokens instead of just text tokens. Or reasoning in latent space which I guess might be even better. There have been papers on both, but I don't know if it's an approach that scales. Regardless gemini 3 / nano banana have had big gains on visual and spatial reasoning, so they must have done something to get multimodality with cross-domain transfer in a way that 4o/gpt-image wasn't able to.
For arc-agi-3, the missing pieces seem to be both "temporal reasoning" and efficient in-context learning. If they can train for this, it'd have benefits for things like tool-calling as well, which is why it's an exciting benchmark.
I think we're noticing that our goalposts for AGI were largely "we'll recognize it when we see it", and now as we are getting to some interesting places, it turns out that different people actually understood very different things by that.
> 5 years ago, the argument that AGI wasn't here yet was that you couldn't take something like AlphaGo and use it to play chess.
No; that was one, extremely limited example of a broader idea. If I point out that your machine is not a general calculator because it gives the wrong answer for six times nine, and then you fix the result it gives in that case, you have not refuted me. If I now find that the answer is incorrect in some other case, I am not "moving goalposts" by pointing it out.
(But also, what lxgr said.)
> But now, we have LLMs that can reliably beat video games like Pokemon, without any specialized training for playing video games. And those same LLMs can write code, do math, write poetry, be language tutors, find optimal flight routes from one city to another during the busy Christmas season, etc.
The AI systems that do most of these things are not "LLMs".
> It's literally as capable as a high school student for almost any general task you throw it at.
And yet embarrassing deficiencies are found all the time ("how many r's in strawberry", getting duped by straightforward problems dressed up to resemble classic riddles but without the actual gotcha, etc.).
> The AI systems that do most of these things are not "LLMs".
Uh, every single example that I listed except for the 'playing video games' example is something that I regularly use frontier models to do for myself. I have ChatGPT and Gemini help me find flight routes, tutor me in Spanish (Gemini 3 is really good at this), write poetry and code, solve professional math problems (usually related to finance and trading), help me fix technical issues with my phone and laptop, etc etc.
If you say to yourself, "hey this thing is a general intelligence, I should try to throw it at problems I have generally", you'll find yourself astonished at the range of tasks with which it can outperform you.