>* For years, despite functional evidence and scientific hints accumulating, certain AI researchers continued to claim LLMs were stochastic parrots: probabilistic machines that would: 1. NOT have any representation about the meaning of the prompt. 2. NOT have any representation about what they were going to say. In 2025 finally almost everybody stopped saying so.
Man, Antirez and I walk in very different circles! I still feel like LLMs fall over backwards once you give them an 'unusual' or 'rare' task that isn't likely to be presented in the training data.
LLMs certainly struggle with tasks that require knowledge that is not provided to them (at significant enough volume/variance to retain it). But this is to be expected of any intelligent agent, it is certainly true of humans. It is not a good argument to support the claim that they are Chinese Rooms (unthinking imitators). Indeed, the whole point of the Chinese Room thought experiment was to consider if that distinction even mattered.
When it comes to of being able to do novel tasks on known knowledge, they seem to be quite good. One also needs to consider that problem-solving patterns are also a kind of (meta-)knowledge that needs to be taught, either through imitation/memorisation (Supervised Learning) or through practice (Reinforcement Learning). They can be logically derived from other techniques to an extent, just like new knowledge can be derived from known knowledge in general, and again LLMs seem to be pretty decent at this, but only to a point. Regardless, all of this is definitely true of humans too.
In most cases, LLMs has the knowledge(data). They just can't generalize them like human do. They can only reflect explicit things that are already there.
I don't think that's true. Consider that the "reasoning" behaviour trained with Reinforcement Learning in the last generation of "thinking" LLMs is trained on quite narrow datasets of olympiad math / programming problems and various science exams, since exact unambiguous answers are needed to have a good reward signal, and you want to exercise it on problems that require non-trivial logical derivation or calculation. Then this reasoning behaviour gets generalised very effectively to a myriad of contexts the user asks about that have nothing to do with that training data. That's just one recent example.
Generally, I use LLMs routinely on queries definitely no-one has written about. Are there similar texts out there that the LLM can put together and get the answer by analogy? Sure, to a degree, but at what point are we gonna start calling that intelligent? If that's not generalisation I'm not sure what is.
To what degree can you claim as a human that you are not just imitating knowledge patterns or problem-solving patterns, abstract or concrete, that you (or your ancestors) have seen before? Either via general observation or through intentional trial-and-error. It may be a conscious or unconscious process, many such patterns get backed into what we call intuition.
Are LLMs as good as humans at this? No, of course, sometimes they get close. But that's a question of degree, it's no argument to claim that they are somehow qualitatively lesser.
"In 2025 finally almost everybody stopped saying so."
I haven't.
Some people are slower to understand things.
Well exactly ;)
I don’t think this is quite true.
I’ve seen them do fine on tasks that are clearly not in the training data, and it seems to me that they struggle when some particular type of task or solution or approach might be something they haven’t been exposed to, rather than the exact task.
In the context of the paragraph you quoted, that’s an important distinction.
It seems quite clear to me that they are getting at the meaning of the prompt and are able, at least somewhat, to generalise and connect aspects of their training to “plan” and output a meaningful response.
This certainly doesn’t seem all that deep (at times frustratingly shallow) and I can see how at first glance it might look like everything was just regurgitated training data, but my repeated experience (especially over the last ~6-9 months) is that there’s something more than that happening, which feels like whet Antirez was getting at.
Give me an example of one of those rare or unusual tasks.
I work on a few HPC systems with unusual, kinda custom-rolled architectures. A whole bunch of Python and R packages fail to compile on these systems. There's no publicly accessible documentation for these HPC systems, nor for these custom architectures. ChatGPT and Claude so far have given me only wrong advice on how to get around these compilation errors and there's not much on Google for these errors, but HPC staff usually knew what to do.
Set the font size of a simple field in openxml. Doesn't even seem that rare. It said to add a run inside and set the font there. Didn't do anything. I ended up reverse engineering the output out of ms word. This happened yesterday.