We will see at the end of April right? It's more of a guess than a strongly held conviction--but I see models improving rapidly at long horizon tasks so I think it's possible. I think a benchmark which can survive a few months (maybe) would be if it genuinely tested long time-frame continual learning/test-time learning/test-time posttraining (idk honestly the differences b/t these).
But i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun.
Its possibly label noise. But you can't tell from a single number.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)
The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).
AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.
So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.
I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.
I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).
It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.
Arc-AGI score isn't correlated with anything useful.
It's correlated with the ability to solve logic puzzles.
It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful
IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".
I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month
Not very likely?
ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.
We will see at the end of April right? It's more of a guess than a strongly held conviction--but I see models improving rapidly at long horizon tasks so I think it's possible. I think a benchmark which can survive a few months (maybe) would be if it genuinely tested long time-frame continual learning/test-time learning/test-time posttraining (idk honestly the differences b/t these).
But i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun.
The AGI bar has to be set even higher, yet again.
And that's the way it should be. We're past the "Look! It can talk! How cute!" stage. AGI should be able to deal with any problem a human can.
wow solving useless puzzles, such a useful metric!
How is spatial reasoning useless??
It's still useful as a benchmark of cost/efficiency.
But why only a +0.5% increase for MMMU-Pro?
Its possibly label noise. But you can't tell from a single number.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.
But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)
Are humans 100%?
If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.
But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.
Actually faster and worse is a very common characterization of a LOT of automation.
That's true.
The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).
AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.
So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.
I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.
I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).
It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.
Arc-AGI score isn't correlated with anything useful.
It's correlated with the ability to solve logic puzzles.
It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?
Give it a prompt like
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful
IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".