Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

I tell this as a person who really enjoys AI by the way.

> does leak per definition.

As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.

The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.

> which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:

> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)

This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.

> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)

So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.

EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):

"To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."

But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.

Chollet himself says "We certified these scores in the past few days." https://x.com/fchollet/status/2021983310541729894.

The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.

There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.

They could also cheat on the private set though. The frontier models presumably never leave the provider's datacenter. So either the frontier models aren't permitted to test on the private set, or the private set gets sent out to the datacenter.

But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient.

Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.

> Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.

I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.

And obviously what actually matters is performance on real-world tasks.

[deleted]

* that you weren't supposed to be able to

[deleted]

Could it also be that the models are just a lot better than a year ago?

> Could it also be that the models are just a lot better than a year ago?

No, the proof is in the pudding.

After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

> If Gemini 3 DT was better we would have falling prices of electricity and everything else at least

Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.

You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.

This is from the BLS consumer survey report released in dec[1]

[1]https://www.bls.gov/news.release/cesan.nr0.htm

[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/

Prices are never going back to 2019 numbers though

That's an improper analysis.

First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.

Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.

Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny.

This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.

We have centuries of historical evidence that people really, really don’t like high inflation, and it takes a while & a lot of turmoil for those shocks to work their way through society.

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

I don't understand what you want to tell us with this image.

they're accusing GGP of moving the goalposts.

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

Does folding a protein count? How about increasing performance at Go?

"Optimize this extremely nontrivial algorithm" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence.

It's worth noting that neither of those were accomplished by LLMs.

[deleted]