> intelligence for those specific games is baked into the harness
This is your claim but the other commenter claims the harness consists only of generic tools. What's the reality?
I also encountered confusion about this exact issue in another subthread. I had thought that generic tooling was allowed but others believed the benchmark to be limited to ingesting the raw text directly from the API without access to any agent environment however generic it might be.
1) Pointing out what tools to use is part of the intelligence that LLMs aren't great at.
2) one of the tools is a path finding algorithm. A big improvement/crutch over a regular LLM that has no such capability.
You'd think if LLMs are intelligent they'd be able to determine that a path finding algorithm is necessary and have a sub agent code it up real quick. But apparently they just can't do that without humans stepping in to make it a standard tool for them.
Here's the paper on what they did for the Duke Harness:
https://blog.alexisfox.dev/arcagi3
>You'd think if LLMs are intelligent they'd be able to determine that a path finding algorithm is necessary and have a sub agent code it up real quick.
ARC 3 doesn't allow that so.
>Here's the paper on what they did for the Duke Harness: https://blog.alexisfox.dev/arcagi3
Yeah, and the tools are general, not 'baked into the harness by the humans who coded it for this specific challenge.'
Adding a path finding algorithm and environment transform tools to a supposed "AGI", sure does seem like cheating to me. Sad part is, it's a cheat that only works on environments where pathfinding is a major part. And when it doesn't have those clues it bombs on everything.
I guess you really want to love the current SOTA LLMs. It's a shame they're dumb af.
Have a great day.
>Adding a path finding algorithm and environment transform tools to a supposed "AGI", sure does seem like cheating to me.
You would need all that if you, a human wanted any chance of solving this benchmark in the format LLMs are given. The funny thing about this benchmark is that we don't even know how solvable it is, because the baseline is tested with radically different inputs.
>I guess you really want to love the current SOTA LLMs. It's a shame they're dumb af.
I guess you really don't want to think critically. Yeah good day lol.
Really tired of you making up stuff about this. The baseline and entire benchmark evaluation is clearly defined, with a statistically sound number of participants for the baseline using the same consistent deterministic environments to perform evaluation. The fact you don't like where the "human performance" line was drawn or how the scale is derived is not the same as the benchmark being tested with "radically different inputs". Clearly you would rather hype AI than critically advance it. So I won't waste time with someone who is clearly not posting in good faith.
Byebye now.
Humans and LLMs are not seeing the benchmark in the same format. What's made up about that ? Can you solve this in the JSON format ?
Look man, don't reply if you don't want to.