>Last week, the ARC Prize team released an updated test, called ARC-AGI-2, and it appears to have sent the AIs back to the drawing board. The full o3 model has not yet been tested, but a version of o1 dropped from 32 percent on the original puzzles to just 3 percent on the new version, and a “mini” version of o3 currently available to the public dropped from roughly 30 percent to below 2 percent. (An OpenAI spokesperson declined to say whether the company plans to run the benchmark with o3.) Other flagship models from OpenAI, Anthropic, and Google have achieved roughly 1 percent, if not lower. Human testers average about 60 percent.
Arc AGI is the main reason why I don't trust static bench marks.
If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.
Forget all these fancy benchmarks. If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar. I've had _every_ model fail this on regular grammars with strings of more than 4 characters long.
LLMs are the solution to natural language, which is a huge deal. They aren't the solution to reasoning which is still best solved with what used to be called symbolic AI before it started working, e.g. sat solvers.
I tried my own test recently:
"Write a history of the Greek language but reverse it, so that one would need to read it from right to left and bottom to top."
ChatGPT wrote the history and showed absolutely no awareness, let alone, "understanding" of the second half of the prompt.
I have a similar test for image gens. I try to get them to write reversed text in condensation on windows. The new GPT is the best so far, it can sorta, maybe, do it sometimes. Others will sometimes reverse the letter order, but not flip each character.
As much I think AI is overhyped too, that is a prime use case that would be better solved by passing the text to a tool, rather than jam a complex transformations like that into its latent space.
With o3-mini-high (just the last paragraph):
civilization Mycenaean the of practices religious and economic, administrative the into insights invaluable provides and B Linear as known script the in recorded was language Greek the of form attested earliest The
Oh, interesting, what do you get when you specify that the letters need to be reversed, too? (That was what I meant and the original prompt explicitly stated that requirement. I forgot to include it in the summary of my 'test' here.)
Try playing a game of Hangman with ChatGPT. It's hilarious.
It does surprisingly well!
Edit: scratch that, it thought there was a six letter word starting with "trs" and then changed its mind to "tre" when I guessed "e." Hilarious.
Just copied your prompt and it handled it just fine.
?siht ekil kool rewsna eht diD
Edit: realized just now that my summary of the 'test' failed to specify the request fully: the letters need to be reversed, too. Maybe I'm just bad with AI tools, because I didn't even get a response that 'this like looked' (i.e. reversed the order of the words).
LLMs work with tokens, not letters. So that's not going to work.
It might work in an agent system where it can make and execute code to solve problems.
Show me the results of your symbolic AI on ARC 2.
ARC 2 is brand new, but neurosymbolic approaches have performed well on the original ARC, e.g. https://arxiv.org/abs/2411.02272
> best solved with what used to be called symbolic AI before it started working
Right, the current paradigm of requiring an LLM to do arbitrary digit multiplication will not work and we shouldn’t need to. If your task is “do X” and it can be reliably accomplished with “write a python program to do X” that’s good enough as far as I’m concerned. It’s preferable, in fact.
Btw Chollet has said basically as much. He calls them “stored programs” I think.
I think he is onto something. The right atomic to approach these problems is probably not the token, at least at first. Higher level abstraction should be refined to specific components, similar to the concept of diffusion.
As soon as the companies behind these systems stop marketing them as do-anything machines, I will stop judging them on their ability to do everything.
The ChatGPT input field still says ‘Ask anything’, and that is what I shall do.
You can ask me anything. I don’t see that as a promise that I am infallible.
Pricing Schedule
__________________
Answers: $1
Thoughtful Answers: $5
Correct Answers: $50
Dumb Looks are Free
> that’s good enough as far as I’m concerned
But in that case, why an LLM. If we want Question-Answer machines to be reliable, they must have the skills which include "counting" just as a basic example.
The purpose of the LLM would be to translate natural language into computer language, not to do the calculation itself.
Most human ten year olds in school can add two large numbers together. If a connectionist network is supposed to model the human brain, it should be able to do that. Maybe LLMs can do a lot of things, but if they can't do that, then they're an incomplete model of the human brain.
If I were to guess, most (adult) humans could not add two 3 digit numbers together with 100% accuracy. Maybe 99%? Computers can already do 100%, so we should probably be trying to figure out how to use language to extract the numbers from stuff and send them off to computers to do the calculations. Especially because in the real world most numbers that matter are not just two digits addition
Artificial neural nets are pretty far from brains. We don’t use them because they are like brains, we use them because they can approximate arbitrary functions given sufficient data. In other words, they work.
For what it’s worth, people are also pretty bad at math compared to calculators. We are slow and error prone. That’s ok.
What I was (poorly) trying to say is that I don’t care if the neural net solves the problem if it can outsource it to a calculator. People do the same thing. What is important is reliably accomplishing the goal.
Most human ten year olds can add two large numbers together with the aid of a scratchpad and a pen. You need tools other than a single dimensional vector of text to do some of these things.
No LLM or other modern AI architecture I'm aware of is supposed to model the human brain. Even if they were, LLMs can add large numbers with the level of skill I'd expect from a 10 year old:
----
What's 494547645908151+7640745309351279642?
ChatGPT said: The sum of 494,547,645,908,151 and 7,640,745,309,351,279,642 is:
7,641,239,857,997,187,793
----
(7,641,239,856,997,187,793 is the correct answer)
I tried it on gpt-4-turbo and it seems to give the right answer:
>Let's calculate:494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >Answer: 7,641,239,856,997,187,793
> If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.
Sounds like a use-case for property testing: https://en.wikipedia.org/wiki/Software_testing#Property_test...
> I've had _every_ model fail this
That seems to be because LLMs don't seem to be able to follow procedures (e.g. reliably counting).
>> If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar.
I'm not sure I understand what that means - could you explain please?
It means applying specific rules about how text can be generated. For example, generating valid json reliably. Currently we use constrained decoding to accomplish this (e.g. the next token must be one of three valid options).
Now you can imagine giving an LLM arbitrary validity rules for generating text. I think that’s what they mean by “grammar”.
I'm not GP, but here goes:
LLMs are token-based, which are words or word fragments; they have limited ability to work on a letter-by-letter basis. They can't reliably count letters in a sentence, for example. "give it a string and a grammar and ask it to generate the string from the grammar" can't be done by inference alone because of this: they would generate tokens that don't match the grammar.
But you can use a grammar-based sampler and it'll generate valid strings just fine. llama.cpp can easily do this if you provide an EBNF grammar specification.
It's not about the generation, it's about verification.
Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.
Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.
The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:
---
Given the following grammar:
Is the following sentence valid:Rome Paris Rome end_path Rome London end_path end_company
---
Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.