Both new Sonnet and gpt-4o still fail at a simple:
"How many w's are in strawberry?"
gpt-4o: There are 2 "w's" in "strawberry."
Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.
(same question with 'r' succeeds)
What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.
It's bad at directly working on classical computer problems like math and data processing. But you can do it indirectly by having it write a program that produces the correct result. Interestingly, I didn't even have to have it run the program, although usually you would
Which produced:I always thought the halting problem was an academic exercise, but here we see a potential practical use case. Actually this seems pretty dangerous letting the LLM write and automatically execute code. How good is the sandbox? Can I trick the LLM into writing a reverse shell and opening it up for me?
I'm not sure it's still the case, but I've had ChatGPT run shell commands. But I don't know what you could do since it's ephemeral, doesn't have internet access or root. Plus I'm sure they have security scanning.
There's always that one tokenization error comment
There are models around that will answer this question perfectly.
But why would anyone care? If you want to count w's in a word, that's a Python one-liner, you don't need an LLM for that (although the LLM can write it for you).
They are trained on tokens not characters.
Can we stop with these useless strawberry examples?