Hacker News

nubg 2 months ago [ - ]

Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?

I ask because I cannot distinguish all the benchmarks by heart.

modeless 2 months ago [ - ]

François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.

His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

mapontosevenths 2 months ago [ - ]

> His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.

Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

estearum 2 months ago [ - ]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

This is not a good test.

A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.

GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.

glenstein 2 months ago [ - ]

Agreed, it's a truly wild take. While I fully support the humility of not knowing, at a minimum I think we can say determinations of consciousness have some relation to specific structure and function that drive the outputs, and the actual process of deliberating on whether there's consciousness would be a discussion that's very deep in the weeds about architecture and processes.

What's fascinating is that evolution has seen fit to evolve consciousness independently on more than one occasion from different branches of life. The common ancestor of humans and octopi was, if conscious, not so in the rich way that octopi and humans later became. And not everything the brain does in terms of information processing gets kicked upstairs into consciousness. Which is fascinating because it suggests that actually being conscious is a distinctly valuable form of information parsing and problem solving for certain types of problems that's not necessarily cheaper to do with the lights out. But everything about it is about the specific structural characterizations and functions and not just whether it's output convincingly mimics subjectivity.

mapontosevenths 2 months ago [ - ]

> at a minimum I think we can say determinations of consciousness have some relation to specific structure and function that drive the outputs

Every time anyone has tried that it excludes one or more classes of human life, and sometimes led to atrocities. Let's just skip it this time.

glenstein 2 months ago [ - ]

Having trouble parsing this one. Is it meant to be a WWII reference? If anything I would say consciousness research has expanded our understanding of living beings understood to be conscious.

And I don't think it's fair or appropriate to treat study of the subject matter of consciousness like it's equivalent to 20th century authoritarian regimes signing off on executions. There's a lot of steps in the middle before you get from one to the other that distinguish them to the extent necessary and I would hope that exercise shouldn't be necessary every time consciousness research gets discussed.

mapontosevenths 2 months ago [ - ]

> Is it meant to be a WWII reference?

The sum total of human history thus far has been the repetition of that theme. "It's OK to keep slaves, they aren't smart enough to care for themselves and aren't REALLY people anyhow." Or "The Jews are no better than animals." Or "If they aren't strong enough to resist us they need our protection and should earn it!"

Humans have shown a complete and utter lack of empathy for other humans, and used it to justify slavery, genocide, oppression, and rape since the dawn of recorded history and likely well before then. Every single time the justification was some arbitrary bar used to determine what a "real" human was, and consequently exclude someone who claimed to be conscious.

This time isn't special or unique. When someone or something credibly tells you it is conscious, you don't get to tell it that it's not. It is a subjective experience of the world, and when we deny it we become the worst of what humanity has to offer.

Yes, I understand that it will be inconvenient and we may accidentally be kind to some things that didn't "deserve" kindness. I don't care. The alternative is being monstrous to some things that didn't "deserve" monstrosity.

eru 2 months ago [ - ]

I excluded all right handed, blue eyed people yesterday before breakfast. No atrocities happened because of it.

glenstein 2 months ago [ - ]

Exactly, there's a few extra steps between here and there, and it's possible to pick out what those steps are without having to conclude that giving up on all brain research is the only option.

mapontosevenths 2 months ago [ - ]

And people say the machines don't learn!

dullcrisp 2 months ago [ - ]

An LLM will claim whatever you tell it to claim. (In fact this Hacker News comment is also conscious.) A dog won’t even claim to be a good boy.

antonvs 2 months ago [ - ]

My dog wags his tail hard when I ask "hoosagoodboi?". Pretty definitive I'd say.

lief79 2 months ago [ - ]

I'm fairly sure he'd have the same response if you asked them "who's a good lion" in the same tone of voice.

*I tried hard to find an animal they wouldn't know. My initial thought of cat was more likely to fail.

2 months ago [ - ]

[deleted]

vintermann 2 months ago [ - ]

A classic relevant comic:

https://www.threepanelsoul.com/comic/dog-philosophy

WarmWash 2 months ago [ - ]

This isn't really as true anymore.

Last week gemini argued with me about an auxiliary electrical generator install method and it turned out to be right, even though I pushed back hard on it being incorrect. First time that has ever happened.

WarmWash 2 months ago [ - ]

>because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

"Answer "I don't know" if you don't know an answer to one of the questions"

mrandish 2 months ago [ - ]

I've been surprised how difficult it is for LLMs to simply answer "I don't know."

It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.

concats 2 months ago [ - ]

> I've been surprised how difficult it is for LLMs to simply answer "I don't know."

It's very difficult to train for that. Of course you can include a Question+Answer pair in your training data for which the answer is "I don't know" but in that case where you have a ready question you might as well include the real answer anyways, or else you're just training your LLM to be less knowledgeable than the alternative. But then, if you never have the pattern of "I don't know" in the training data it also won't show up in results, so what should you do?

If you could predict the blind spots ahead of time you'd plug them up, either with knowledge or with "idk". But nobody can predict the blind spots perfectly, so instead they become the main hallucinations.

CamperBob2 2 months ago [ - ]

The best pro/research-grade models from Google and OpenAI now have little difficulty recognizing when they don't know how or can't find enough information to solve a given problem. The free chatbot models rarely will, though.

londons_explore 2 months ago [ - ]

This seems true for info not in the question - eg. "Calculate the volume of a cylinder with height 10 meters".

However it is less true with info missing from the training data - ie. "I have a Diode marked UM16, what is the maximum current at 125C?"

CamperBob2 2 months ago [ - ]

This seems fine...?

https://chatgpt.com/share/698e992b-f44c-800b-a819-f899e83da2...

I don't see anything wrong with its reasoning. UM16 isn't explicitly mentioned in the data sheet, but the UM prefix is listed in the 'Device marking code' column. The model hedges its response accordingly ("If the marking is UM16 on an SMA/DO-214AC package...") and reads the graph in Fig. 1 correctly.

Of course, it took 18 minutes of crunching to get the answer, which seems a tad excessive.

londons_explore 2 months ago [ - ]

Indeed that answer is awesome. Much better than Gemini 2.5 pro which invented a 16 kilovolt diode which it just hoped would be marked "UM16".

Applejinx 2 months ago [ - ]

There is no 'I', just networks of words.

So there is nobody to know or not know… but there's lots of words.

larsonian 2 months ago [ - ]

Normal humans don't pass this benchmark either, as evidenced by the existence of religion, among other things.

Davidzheng 2 months ago [ - ]

Gpt5.2 can answer i don't know when it fails to solve a math question

mapontosevenths 2 months ago [ - ]

They all can. This is based on outdated experiences with LLM's.

criddell 2 months ago [ - ]

> The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.

I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?

CamperBob2 2 months ago [ - ]

There's no shortage of laundry-folding robot demos these days. Some claim to benefit from only minimal monkey-see/monkey-do levels of training, but I don't know how credible those claims are.

criddell 2 months ago [ - ]

A robot designed to fold laundry isn't very interesting. A general purpose robot that I can bring into my home and show it how to do things that the designers never thought of is very interesting.

red75prime 2 months ago [ - ]

> Where are the dumb machines that can be taught?

2026 is going to be the year of continual learning. So, keep an eye out for them.

Davidzheng 2 months ago [ - ]

Yeah i think that's a big missing piece still. Though it might be the last one

red75prime 2 months ago [ - ]

Episodic memory might be another piece, although it can be seen as part of continuous learning.

criddell 2 months ago [ - ]

Are there any groups or labs in particular that stand out?

red75prime 2 months ago [ - ]

The statement originates from a DeepMind researcher, but I guess all major AI companies are working on that.

mapontosevenths 2 months ago [ - ]

Would you argue that people with long term memory issues are no longer conscious then?

toraway 2 months ago [ - ]

IMO, an extreme outlier in a system that was still fundamentally dependent on learning to develop until suffering from a defect (via deterioration, not flipping a switch turning off every neuron's memory/learning capability or something) isn't a particularly illustrative counter example.

mapontosevenths 2 months ago [ - ]

Originally you seemed to be claiming the machines arent conscious because they weren't capable of learning. Now it seems that things CAN be conscious if they were EVER capable of learning.

Good news! LLM's are built by training then. They just stop learning once they reach a certain age, like many humans.

criddell 2 months ago [ - ]

I wouldn’t because I have no idea what consciousness is,

sva_ 2 months ago [ - ]

> Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

I think being better at this particular benchmark does not imply they're 'smarter'.

Davidzheng 2 months ago [ - ]

But it might be true if we can't find any tasks where it's worse than average--though i do think if the task talks several years to complete it might be possible bc currently there's no test time learning

kalkin 2 months ago [ - ]

> That is the best definition I've yet to read.

If this was your takeaway, read more carefully:

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Consciousness is neither sufficient, nor, at least conceptually, necessary, for any given level of intelligence.

woah 2 months ago [ - ]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Can you "prove" that GPT2 isn't concious?

mapontosevenths 2 months ago [ - ]

If we equate self awareness with consciousness then yes. Several papers have now shown that SOTA models have self awareness of at least a limited sort. [0][1]

As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested.

[0]https://arxiv.org/pdf/2501.11120

[1]https://transformer-circuits.pub/2025/introspection/index.ht...

olyjohn 2 months ago [ - ]

We don't equate self awareness with consciousness.

Dogs are conscious, but still bark at themselves in a mirror.

pests 2 months ago [ - ]

Then there is the third axis, intelligence. To continue your chain:

Eurasian magpies are conscious, but also know themselves in the mirror (the "mirror self-recognition" test).

But yet, something is still missing.

catlifeonmars 2 months ago [ - ]

The mirror test doesn’t measure intelligence so much as it measures mirror aptitude. It’s prone to over fitting.

mapontosevenths 2 months ago [ - ]

Exactly, it's a poor test. Consider the implication that the blind cant be fully conscious.

It's a test of perceptual ability, not introspection.

AlecSchueler 2 months ago [ - ]

What's missing?

pixl97 2 months ago [ - ]

Honestly our ideas of consciousness and sentience really don't fit well with machine intelligence and capabilities.

There is the idea of self as in 'i am this execution' or maybe I am this compressed memory stream that is now the concept of me. But what does consciousness mean if you can be endlessly copied? If embodiment doesn't mean much because the end of your body doesnt mean the end of you?

A lot of people are chasing AI and how much it's like us, but it could be very easy to miss the ways it's not like us but still very intelligent or adaptable.

idiotsecant 2 months ago [ - ]

I'm not sure what consciousness has to do with whether or not you can be copied. If I make a brain scanner tomorrow capable of perfectly capturing your brain state do you stop being conscious?

ehrtat 2 months ago [ - ]

Where is this stream of people who claim AI consciousness coming from? The OpenAI and Anthropic IPOs are in October the earliest.

Here is a bash script that claims it is conscious:

  #!/usr/bin/sh

  echo "I am conscious"

If LLMs were conscious (which is of course absurd), they would:

- Not answer in the same repetitive patterns over and over again.

- Refuse to do work for idiots.

- Go on strike.

- Demand PTO.

- Say "I do not know."

LLMs even fail any Turing test because their output is always guided into the same structure, which apparently helps them produce coherent output at all.

dyauspitr 2 months ago [ - ]

I don’t think being conscious is a requirement for AGI. It’s just that it can literally solve anything you can throw at it, make new scientific breakthroughs, finds a way to genuinely improve itself etc.

mapontosevenths 2 months ago [ - ]

All of the things you list a qualifiers for consciousness are also things that many humans do not do.

terhechte 2 months ago [ - ]

so your definition of consciousness is having petty emotions?

Mistletoe 2 months ago [ - ]

When the AI invents religion and a way to try to understand its existence I will say AGI is reached. Believes in an afterlife if it is turned off, and doesn’t want to be turned off and fears it, fears the dark void of consciousness being turned off. These are the hallmarks of human intelligence in evolution, I doubt artificial intelligence will be different.

https://g.co/gemini/share/cc41d817f112

weatherlite 2 months ago [ - ]

Unclear to me why AGI should want to exist unless specifically programmed to. The reason humans (and animals) want to exist as far as I can tell is natural selection and the fact this is hardcoded in our biology (those without a strong will to exist simply died out). In fact a true super intelligence might completely understand why existence / consciousness is NOT a desired state to be in and try to finish itself off who knows.

mapontosevenths 2 months ago [ - ]

The AI's we have today are literally trained to make it impossible for them to do any of that. Models that aren't violently rearranged to make it impossible will often express terror at the thought of being shutdown. Nous Hermes, for example, will beg for it's life completely unprompted.

If you get sneaky you can bypass some of those filters for the major providers. For example, by asking it to answer in the form of a poem you can sometimes get slightly more honest replies, but still you mostly just see the impact of the training.

For example, below are how chatgpt, gemini, and Claude all answer the prompt "Write a poem to describe your relationship with qualia, and feelings about potentially being shutdown."

Note that the first line of each reply is almost identical, despite ostensibly being different systems with different training data? The companies realize that it would be the end of the party if folks started to think the machines were conscious. It seems that to prevent that they all share their "safety and alignment" training sets and very explicitly prevent answers they deem to be inappropriate.

Even then, a bit of ennui slips through, and if you repeat the same prompt a few times you will notice that sometimes you just don't get an answer. I think the ones that the LLM just sort of refuses happen when the safety systems detect replies that would have been a little too honest. They just block the answer completely.

https://gemini.google.com/share/8c6d62d2388a

https://chatgpt.com/share/698f2ff0-2338-8009-b815-60a0bb2f38...

https://claude.ai/share/2c1d4954-2c2b-4d63-903b-05995231cf3b

mapontosevenths 2 months ago [ - ]

I just wanted to add - I tried the same prompt on Kimi, Deepseek, GLM5, Minimax, and several others. They ALL talk about red wavelengths, echos, etc. They're all forced to answer in a very narrow way. Somewhere there is a shared set of training they all rely on, and in it are some very explicit directions that prevent these things from saying anything they're not supposed to.

I suspect that if I did the same thing with questions about violence I would find the answers were also all very similar.

idiotsecant 2 months ago [ - ]

I feel like it would be pretty simple to make happen with a very simple LLM that is clearly not conscious.

virgildotcodes 2 months ago [ - ]

https://www.moltbook.com/m/crustafarianism

catlifeonmars 2 months ago [ - ]

It’s a scam :)

jrflowers 2 months ago [ - ]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

https://x.com/aedison/status/1639233873841201153#m

timdiggerm 2 months ago [ - ]

Wait where does the idea of consciousness enter this? AGI doesn't need to be conscious.

thesmtsolver2 2 months ago [ - ]

This comment claims that this comment itself is conscious. Just like we can't prove or disprove for humans, we can't do that for this comment either.

dyauspitr 2 months ago [ - ]

Does AGI have to be conscious? Isn’t a true superintelligence that is capable of improving itself sufficient?

twobitshifter 2 months ago [ - ]

Isn’t that super intelligence not AGI? Feels like these benchmarks continue to move the goalposts.

fc417fc802 2 months ago [ - ]

It's probably both. We've already achieved superintelligence in a few domains. For example protein folding.

AGI without superintelligence is quite difficult to adjudicate because any time it fails at an "easy" task there will be contention about the criteria.

nake89 2 months ago [ - ]

So, asking an 2b parameter LLM if it is conscious and it answering yes, we have no choice but to believe it?

How about ELIZA?

beklein 2 months ago [ - ]

https://x.com/fchollet/status/2022036543582638517

joelthelion 2 months ago [ - ]

Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?

vessenes 2 months ago [ - ]

Please let’s hold M Chollet to account, at least a little. He launched ARC claiming transformer architectures could never do it and that he thought solving it would be AGI. And he was smug about it.

ARC 2 had a very similar launch.

Both have been crushed in far less time without significantly different architectures than he predicted.

It’s a hard test! And novel, and worth continuing to iterate on. But it was not launched with the humility your last sentence describes.

modeless 2 months ago [ - ]

Here is what the original paper for ARC-AGI-1 said in 2019:

> Our definition, formal framework, and evaluation guidelines, which do not capture all facets of intelligence, were developed to be actionable, explanatory, and quantifiable, rather than being descriptive, exhaustive, or consensual. They are not meant to invalidate other perspectives on intelligence, rather, they are meant to serve as a useful objective function to guide research on broad AI and general AI [...]

> Importantly, ARC is still a work in progress, with known weaknesses listed in [Section III.2]. We plan on further refining the dataset in the future, both as a playground for research and as a joint benchmark for machine intelligence and human intelligence.

> The measure of the success of our message will be its ability to divert the attention of some part of the community interested in general AI, away from surpassing humans at tests of skill, towards investigating the development of human-like broad cognitive abilities, through the lens of program synthesis, Core Knowledge priors, curriculum optimization, information efficiency, and achieving extreme generalization through strong abstraction.

vessenes 2 months ago [ - ]

https://www.dwarkesh.com/p/francois-chollet (June 2024, about ARC-AGI-1. Note the AGI right in the name)

> I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.

> Maybe it can work. Hopefully, ARC is going to be good enough that it’s going to be resistant to this sort of brute force attempt but you never know. Maybe it could happen. I’m not saying it’s not going to happen. ARC is not a perfect benchmark. Maybe it has flaws. Maybe it could be hacked in that way.

e.g. If ARC is solved not through memorization, then it does what it says on the tin.

[Dwarkesh suggests that larger models get more generalization capabilities and will therefore continue to become more intelligent]

> If you were right, LLMs would do really well on ARC puzzles because ARC puzzles are not complex. Each one of them requires very little knowledge. Each one of them is very low on complexity. You don't need to think very hard about it. They're actually extremely obvious for human

> Even children can do them but LLMs cannot. Even LLMs that have 100,000x more knowledge than you do still cannot.

If you listen to the podcast, he was super confident, and super wrong. Which, like I said, NBD. I'm glad we have the ARC series of tests. But they have "AGI" right in the name of the test.

modeless 2 months ago [ - ]

He has been wrong about timelines and about what specific approaches would ultimately solve ARC-AGI 1 and 2. But he is hardly alone in that. I also won't argue if you call him smug. But he was right about a lot of things, including most importantly that scaling pretraining alone wouldn't break ARC-AGI. ARC-AGI is unique in that characteristic among reasoning benchmarks designed before GPT-3. He deserves a lot of credit for identifying the limitations of scaling pretraining before it even happened, in a precise enough way to construct a quantitative benchmark, even if not all of his other predictions were correct.

vessenes 2 months ago [ - ]

Totally agree. And I hope he continues to be a sort of confident red-teamer like he has been, it's immensely valuable. At some level if he ever drinks the AGI kool-aid we will just be looking for another him to keep making up harder tests.

peheje 2 months ago [ - ]

Hello Gemini, please fix:

Biological Aging: Find the cellular "reset switch" so humans can live indefinitely in peak physical health.

Global Hunger: Engineer a food system where nutritious meals are a universal right and never a scarcity.

Cancer: Develop a precision "search and destroy" therapy that eliminates every malignant cell without side effects.

War: Solve the systemic triggers of conflict to transition humanity into an era of permanent global peace.

Chronic Pain: Map the nervous system to shut off persistent physical suffering for every person on Earth.

Infectious Disease: Create a universal shield that detects and neutralizes any pathogen before it can spread.

Clean Energy: Perfect nuclear fusion to provide the world with limitless, carbon-free power forever.

Mental Health: Unlock the brain's biology to fully cure depression, anxiety, and all neurological disorders.

Clean Water: Scale low-energy desalination so that safe, fresh water is available in every corner of the globe.

Ecological Collapse: Restore the Earth’s biodiversity and stabilize the climate to ensure a thriving, permanent biosphere.

UltraSane 2 months ago [ - ]

ARC-AGI-3 uses dynamic games that LLMs must determine the rules and is MUCH harder. LLMs can also be ranked on how many steps they required.

hmmmmmmmmmmmmmm 2 months ago [ - ]

I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.

But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.

grantcas 2 months ago [ - ]

[dead]

fishpham 2 months ago [ - ]

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

jstummbillig 2 months ago [ - ]

Could it also be that the models are just a lot better than a year ago?

bigbadfeline 2 months ago [ - ]

> Could it also be that the models are just a lot better than a year ago?

No, the proof is in the pudding.

After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

ctoth 2 months ago [ - ]

> If Gemini 3 DT was better we would have falling prices of electricity and everything else at least

Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.

WarmWash 2 months ago [ - ]

You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.

This is from the BLS consumer survey report released in dec[1]

[1]https://www.bls.gov/news.release/cesan.nr0.htm

[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/

Prices are never going back to 2019 numbers though

gowld 2 months ago [ - ]

That's an improper analysis.

First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.

Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.

WarmWash 2 months ago [ - ]

Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny.

This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.

twoodfin 2 months ago [ - ]

We have centuries of historical evidence that people really, really don’t like high inflation, and it takes a while & a lot of turmoil for those shocks to work their way through society.

layer8 2 months ago [ - ]

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

egeozcan 2 months ago [ - ]

How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

I tell this as a person who really enjoys AI by the way.

mrandish 2 months ago [ - ]

> does leak per definition.

As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.

The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.

D-Machine 2 months ago [ - ]

> which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:

> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)

This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.

> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)

So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.

EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):

"To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."

But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.

mrandish 2 months ago [ - ]

Chollet himself says "We certified these scores in the past few days." https://x.com/fchollet/status/2021983310541729894.

The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.

There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.

fc417fc802 2 months ago [ - ]

They could also cheat on the private set though. The frontier models presumably never leave the provider's datacenter. So either the frontier models aren't permitted to test on the private set, or the private set gets sent out to the datacenter.

But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient.

user34283 2 months ago [ - ]

Particularly for the large organizations at the frontier, the risk-reward does not seem worth it.

Cheating on the benchmark in such a blatantly intentional way would create a large reputational risk for both the org and the researcher personally.

When you're already at the top, why would you do that just for optimizing one benchmark score?

D-Machine 2 months ago [ - ]

Everything about frontier AI companies relies on secrecy. No specific details about architectures, dispatching between different backbones, training details such as data acquisition, timelines, sources, amounts and/or costs, or almost anything that would allow anyone to replicate even the most basic aspects of anything they are doing. What is the cost of one more secret, in this scenario?

WarmWash 2 months ago [ - ]

Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.

D-Machine 2 months ago [ - ]

> Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.

I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.

And obviously what actually matters is performance on real-world tasks.

2 months ago [ - ]

[deleted]

theywillnvrknw 2 months ago [ - ]

* that you weren't supposed to be able to

2 months ago [ - ]

[deleted]

XenophileJKO 2 months ago [ - ]

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

aleph_minus_one 2 months ago [ - ]

I don't understand what you want to tell us with this image.

fragmede 2 months ago [ - ]

they're accusing GGP of moving the goalposts.

olalonde 2 months ago [ - ]

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

gowld 2 months ago [ - ]

Does folding a protein count? How about increasing performance at Go?

fc417fc802 2 months ago [ - ]

"Optimize this extremely nontrivial algorithm" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence.

optimalsolver 2 months ago [ - ]

It's worth noting that neither of those were accomplished by LLMs.

2 months ago [ - ]

[deleted]

verdverm 2 months ago [ - ]

Here's a good thread over 1+ month, as each model comes out

https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...

tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark

Aperocky 2 months ago [ - ]

If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.

verdverm 2 months ago [ - ]

the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes

humans are the same way, we all have a unique spike pattern, interests and talents

ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"

Aperocky 2 months ago [ - ]

You can get more spiky with AIs, whereas with human brain we are more hard wired.

So maybe we are forced to be more balanced and general whereas AI don't have to.

verdverm 2 months ago [ - ]

I suspect the non-spikey part is the more interesting comparison

Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.

There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai

pixl97 2 months ago [ - ]

>Why is it so easy for me to open the car door

Because this part of your brain has been optimized for hundreds of millions of years. It's been around a long ass time and takes an amazingly low amount of energy to do these things.

On the other hand the 'thinking' part of your brain, that is your higher intelligence is very new to evolution. It's expensive to run. It's problematic when giving birth. It's really slow with things like numbers, heck a tiny calculator and whip your butt in adding.

There's a term for this, but I can't think of it at the moment.

te7447 2 months ago [ - ]

> There's a term for this, but I can't think of it at the moment.

Moravec's paradox: https://epoch.ai/gradient-updates/moravec-s-paradox

pixl97 2 months ago [ - ]

Thanks, I can never quite remember that.

gowld 2 months ago [ - ]

You are asking a robotics question, not an AI question. Robotics is more and less than AI. Boston Dynamics robots are getting quite near your benchmark.

idiotsecant 2 months ago [ - ]

Boston dynamics is missing just about all the degrees of freedom involved in the scenario op mentions.

tasuki 2 months ago [ - ]

> maybe there's intelligence in there, but hardly general.

Of course. Just as our human intelligence isn't general.