Hacker News

Source? I haven't seen anything like that for ARC-AGI performance.

Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.

famouswaffles 2 days ago [ - ]

That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).

This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.

[1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

[2] https://blog.alexisfox.dev/arcagi3

vbarrielle 2 days ago [ - ]

> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration

This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.

famouswaffles 2 days ago [ - ]

The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is.

>We then tested the harnesses on the full public set (which researchers did not have access to at the time)

daveguy 2 days ago [ - ]

It may have been tested on the full set, but the score you quote is for a single game environment. Not the full public set. That fact is verbatim in what you responded to and vbarrielle quoted. It scored 97% in one game, and 0% in another game. The full prelude to what vbarrielle quoted, the last sentence of which you left out, was:

> We then tested the harnesses on the full public set (which researchers did not have access to at the time). We found extreme bimodal performance across the two sets, controlling for the same frontier model...

The harness only transfers to like-environments and the intelligence for those specific games is baked into the harness by the humans who coded it for this specific challenge.

The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments. Having a human give it more powerful tools in a harness defeats the purpose. You should go back and read the original ARC-AGI paper to see what this is about+. Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?

+ https://arxiv.org/abs/1911.01547

fc417fc802 2 days ago [ - ]

> intelligence for those specific games is baked into the harness

This is your claim but the other commenter claims the harness consists only of generic tools. What's the reality?

I also encountered confusion about this exact issue in another subthread. I had thought that generic tooling was allowed but others believed the benchmark to be limited to ingesting the raw text directly from the API without access to any agent environment however generic it might be.

daveguy a day ago [ - ]

1) Pointing out what tools to use is part of the intelligence that LLMs aren't great at.

2) one of the tools is a path finding algorithm. A big improvement/crutch over a regular LLM that has no such capability.

You'd think if LLMs are intelligent they'd be able to determine that a path finding algorithm is necessary and have a sub agent code it up real quick. But apparently they just can't do that without humans stepping in to make it a standard tool for them.

Here's the paper on what they did for the Duke Harness:

https://blog.alexisfox.dev/arcagi3

famouswaffles 21 hours ago [ - ]

>You'd think if LLMs are intelligent they'd be able to determine that a path finding algorithm is necessary and have a sub agent code it up real quick.

ARC 3 doesn't allow that so.

>Here's the paper on what they did for the Duke Harness: https://blog.alexisfox.dev/arcagi3

Yeah, and the tools are general, not 'baked into the harness by the humans who coded it for this specific challenge.'

daveguy 17 hours ago [ - ]

Adding a path finding algorithm and environment transform tools to a supposed "AGI", sure does seem like cheating to me. Sad part is, it's a cheat that only works on environments where pathfinding is a major part. And when it doesn't have those clues it bombs on everything.

I guess you really want to love the current SOTA LLMs. It's a shame they're dumb af.

Have a great day.

famouswaffles 16 hours ago [ - ]

>Adding a path finding algorithm and environment transform tools to a supposed "AGI", sure does seem like cheating to me.

You would need all that if you, a human wanted any chance of solving this benchmark in the format LLMs are given. The funny thing about this benchmark is that we don't even know how solvable it is, because the baseline is tested with radically different inputs.

>I guess you really want to love the current SOTA LLMs. It's a shame they're dumb af.

I guess you really don't want to think critically. Yeah good day lol.

daveguy 15 hours ago [ - ]

Really tired of you making up stuff about this. The baseline and entire benchmark evaluation is clearly defined, with a statistically sound number of participants for the baseline using the same consistent deterministic environments to perform evaluation. The fact you don't like where the "human performance" line was drawn or how the scale is derived is not the same as the benchmark being tested with "radically different inputs". Clearly you would rather hype AI than critically advance it. So I won't waste time with someone who is clearly not posting in good faith.

Byebye now.

famouswaffles 14 hours ago [ - ]

Humans and LLMs are not seeing the benchmark in the same format. What's made up about that ? Can you solve this in the JSON format ?

Look man, don't reply if you don't want to.

a day ago [ - ]

[deleted]

famouswaffles 21 hours ago [ - ]

>The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments.

The point is whatever Francois wants it to be.

>Having a human give it more powerful tools in a harness defeats the purpose.

Why does it defeat the purpose? Restricting the tools available is an arbitrary constraint. The Duke harness is a few basic tools. What's the problem ? In what universe would any AI Agent worth its salt not have access to read, grep and bash ? If his benchmark was as great and the difference as wide as he claimed, then it simply wouldn't matter if those tools were available. Francois removed access to tools because his benchmark falls apart with them. Simple as.

>You should go back and read the original ARC-AGI paper to see what this is about+.

>Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?

I’m not upset about anything. I do not care about ARC, and I never have. I think it is a nothingburger of a benchmark: lots of grand claims about AGI, but very little predictive power or practical utility.

When models started climbing FrontierMath, that benchmark actually told us something useful: their mathematical capabilities were becoming materially stronger. And now state-of-the-art systems have helped with real research and even contributed to solving open problems. That is what a good benchmark is supposed to do.

ARC ? Has 0 utility on its own and manages to tell you nothing at the same time.

Unsaturated benchmarks matter because they help show where the state of the art actually is. The value is not “look, the score is low,” but whether the benchmark tells you something real and useful about capability. ARC has always struggled on that front, but 3 has taken that to a new level of useless.