>Even if interpretability of specific models or features within them is an open area of research, the mechanics of how LLMs work to produce results are observable and well-understood, and methods to understand their fundamental limitations are pretty solid these days as well.
If you train a transformer on (only) lots and lots of addition pairs, i.e '38393 + 79628 = 118021' and nothing else, the transformer will, during training discover an algorithm for addition and employ it in service of predicting the next token, which in this instance would be the sum of two numbers.
We know this because of tedious interpretability research, the very limited problem space and the fact we knew exactly what to look for.
Alright, let's leave addition aside (SOTA LLMs are after all trained on much more) and think about another question. Any other question at all. How about something like:
"Take a capital letter J and a right parenthesis, ). Take the parenthesis, rotate it counterclockwise 90 degrees, and put it on top of the J. What everyday object does that resemble?"
What algorithm does GPT or Gemini or whatever employ to answer this and similar questions correctly ? It's certainly not the one it learnt for addition. Do you Know ? No. Do the creators at Open AI or Google know ? Not at all. Can you or they find out right now ? Also No.
Let's revisit your statement.
"the mechanics of how LLMs work to produce results are observable and well-understood".
Observable, I'll give you that, but how on earth can you look at the above and sincerely call that 'well-understood' ?
It's pattern matching, likely from typography texts and descriptions of umbrellas. My understanding is that the model can attempt some permutations in its thinking and eventually a permutation's tokens catch enough attention to attempt to solve, and that once it is attending to "everyday object", "arc", and "hook", it will reply with "umbrella".
Why am I confident that it's not actually doing spatial reasoning? At least in the case of Claude Opus 4.6, it also confidently replies "umbrella" even when you tell it to put the parenthesis under the J, with a handy diagram clearly proving itself wrong: https://claude.ai/share/497ad081-c73f-44d7-96db-cec33e6c0ae3 . Here's me specifically asking for the three key points above: https://claude.ai/share/b529f15b-0dfe-4662-9f18-97363f7971d1
I feel like I have a pretty good intuition of what's happening here based on my understanding of the underlying mathematical mechanics.
Edit: I poked at it a little longer and I was able to get some more specific matches to source material binding the concept of umbrellas being drawn using the letter J: https://claude.ai/share/f8bb90c3-b1a6-4d82-a8ba-2b8da769241e
>It's pattern matching, likely from typography texts and descriptions of umbrellas.
"Pattern matching" is not an explanation of anything, nor does it answer the question I posed. You basically hand waved the problem away in conveniently vague and non-descriptive phrase. Do you think you could publish that in a paper for ext ?
>Why am I confident that it's not actually doing spatial reasoning? At least in the case of Claude Opus 4.6, it also confidently replies "umbrella" even when you tell it to put the parenthesis under the J, with a handy diagram clearly proving itself wrong
I don't know what to tell you but J with the parentheses upside down still resembles an umbrella. To think that a machine would recognize it's just a flipped umbrella and a human wouldn't is amazing, but here we are. It's doubly baffling because Claude quite clearly explains it in your transcript.
>I feel like I have a pretty good intuition of what's happening here based on my understanding of the underlying mathematical mechanics.
Yes I realize that. I'm telling you that you're wrong.
>Do you think you could publish that in a paper for ext ?
You seem to think it's not 'just' tensor arithmetic.
Have you read any of the seminal papers on neutral networks, say?
It's [complex] pattern matching as the parent said.
If you want models to draw composite shapes based on letter forms and typography then you need to train them (or at least fine-tune them) to do that.
I still get opposite (antonym) confusion occasionally in responses to inferences where I expect the training data is relatively lacking.
That said, you claim the parent is wrong. How would you describe LLM models, or generative "AI" models in the confines of a forum post, that demonstrates their error? Happy for you to make reference to academic papers that can aid understanding your position.
>You seem to think it's not 'just' tensor arithmetic.
If I asked you to explain how a car works and you responded with a lecture on metallic bonding in steel, you wouldn’t be saying anything false, but you also wouldn’t be explaining how a car works. You’d be describing an implementation substrate, not a mechanism at the level the question lives at.
Likewise, “it’s tensor arithmetic” is a statement about what the computer physically does, not what computation the model has learned (or how that computation is organized) that makes it behave as it does. It sheds essentially zero light on why the system answers addition correctly, fails on antonyms, hallucinates, generalizes, or forms internal abstractions.
So no: “tensor arithmetic” is not an explanation of LLM behavior in any useful sense. It’s the equivalent of saying “cars move because atoms.”
>It's [complex] pattern matching as the parent said
“Pattern matching”, whether you add [complex] to it or not is not an explanation. It gestures vaguely at “something statistical” without specifying what is matched to what, where, and by what mechanism. If you wrote “it’s complex pattern matching” in the Methods section of a paper, you’d be laughed out of review. It’s a god-of-the-gaps phrase: whenever we don’t know or understand the mechanism, we say “pattern matching” and move on, but make no mistake, it's utterly meaningless and you've managed to say absolutely nothing at all.
And note what this conveniently ignores: modern interpretability work has repeatedly shown that next-token prediction can produce structured internal state that is not well-described as “pattern matching strings”.
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (https://openreview.net/forum?id=DeG07_TcZvT) and Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models (https://openreview.net/forum?id=PPTrmvEnpW&referrer=%5Bthe%2...
Transformers trained on Othello or Chess games (same next token prediction) were demonstrated to have developed internal representations of the rules of the game. When a model predicted the next move in Othello, it wasn't just "pattern matching strings", it had constructed an internal map of the board state you could alter and probe. For Chess, it had even found a way to estimate a player's skill to better predict the next move.
There are other interpretability papers even more interesting than those. Read them, and perhaps you'll understand how little we know.
On the Biology of a Large Language Model - https://transformer-circuits.pub/2025/attribution-graphs/bio...
Emergent Introspective Awareness in Large Language Models - https://transformer-circuits.pub/2025/introspection/index.ht...
>That said, you claim the parent is wrong. How would you describe LLM models, or generative "AI" models in the confines of a forum post, that demonstrates their error? Happy for you to make reference to academic papers that can aid understanding your position.
Nobody understands LLMs anywhere near enough to propose a complete theory that explains all their behaviors and failure modes. The people who think they do are the ones who understand them the least.
What we can say:
- LLMs are trained via next-token prediction and, in doing so, are incentivized to discover algorithms, heuristics, and internal world models that compress training data efficiently.
- These learned algorithms are not hand-coded; they are discovered during training in high-dimensional weight space and because of this, they are largely unknown to us.
- Interpretability research shows these models learn task-specific circuits and representations, some interpretable, many not.
- We do not have a unified theory of what algorithms a given model has learned for most tasks, nor do we fully understand how these algorithms compose or interfere.
I don't have much more to add to the sibling comment other than the fact that the transcript reads
> When you rotate ")" counterclockwise 90°, it becomes a wide, upward-opening arc — like ⌣.
but I'm pretty sure that's what you get if you rotate it clockwise.
From Gemini:When you take those two shapes and combine them, the resulting image looks like an umbrella.