You can never ask why a model did a certain thing, or what it was "thinking" when it said something - just like you can't ask a human which neurons were firing when they had a certain thought. The information just isn't available at that level.
You absolutely can have deep nuanced discussions with LLMs however, you just need to better understand their strengths and weaknesses.
A human won't respond with "Neuron 10-100 of the frontal cortex" (jokes aside) with deceptively convincing confidence.
The human will quite convincingly be able to construct a post-hoc reasoning on an action that may or may not be related at all to what was actually going through their head or the actual instinctual reasons that led to a decision.
Humans can accurately retell what their consciousness was doing, but they have no clue why their unconsciousness responded as it did.
LLM is just that unconsciousness part that humans have to post hoc explain like that, and lacks the conscious part that we humans actually can inspect in ourselves.
If the AI had some introspection part where it actually tracks its reasoning maybe it would be closer to conscious humans. Its too expensive to do that everywhere ofc, not even us humans tracks everything like that, just a tiny bit, but tracking that tiny bit is enough for so much error correction to happen.
"Humans can accurately retell what their consciousness was doing" is often not true, because of complex mechanisms. The feeling of shame alone can make it very hard for someone to accurately describe how the arrived at the wrong conclusion.
Plus it's an open question if this is even a thing. Does consciousness consist of constructing actions beforehand, or of construction justifications afterward?
Frankly, my opinion is that DNA is incredible at choose the most energy efficient/cheap option, and the cheaper option is definitely justifications afterward.
I feel strengthened by psychological experiments where people are shown fake events involving them, where they then "explain their (nonexistent) reasoning at the time".
Arguments for the idea that the human consciousness/soul is something that is emergent keep getting shouted down though. Even though if you take the extreme opposite: it's obviously wrong. Nobody has ever cut open a human skull (or anything else) and found a soul. So somehow it's constructed from very non-conscious components we don't understand, it's not "actually there" in a real sense.
Sufficiently constrained post-hoc justifications are indistinguishable from explanations. Consciousness tries to make things up, it learns that people notice this, it then begins trying to construct justifications that won't be predictably called out as false. Eventually it learns how its unconscious operates, and how to interrogate it, and its post-hoc justifications, at least in the common cases, become reliable.
>Consciousness tries to make things up, it learns that people notice this, it then begins trying to construct justifications that won't be predictably called out as false.
There's a logical "skip" between that and
>Eventually it learns how its unconscious operates, and how to interrogate it, and its post-hoc justifications, at least in the common cases, become reliable.
The brain constructs a narrative that won't be called out as false, one that provides social capital, makes one feel good about oneself, is consistent with all your other justifications, etc. It's only an assumption that this process would naturally converge on Truth, and considering it's massively-multiplayer chaos where brains coordinate their stories in complex ways, my assumption is that this would converge on *stability*, not truth.
Yep. It converges on truth unless there's a strong reward for lies because truth is easy. It's a neural network. It just reads off/probes the internal state because that's the cheapest way to model the unconscious. The justification won't necessarily be true, mind, in terms of the labels it puts, but it should mostly be true structurally- behaviorally predictive in the ordinary domain.
(Even if you are incentivized to lie and flatter yourself, it is still helpful to have access to the true signal internally, because that way you can know how to structure your lie to best avoid detection.)
>Eventually it learns how its unconscious operates
I mean, no we don't, both in a personal way and in a global scientific understanding.
What you're saying happens is a set of socially consistent and acceptable responses based upon general human knowledge at the time. The common cases aren't exactly reliable, it's that they are repeatable in the sense they cover what we expect, and tend to explode when the world is less predictable.
This is why the scientific method changed the world, because we started writing shit down, comparing notes, and striving for repeatability.
I think a better way of putting this is that humans think they can accurately re-tell what their consciousness was doing. Whether they actually can, or even if consciousness exists at all as a thing outside the perception of consciousness is a philosophical question currently beyond answering.
I wonder if monte carlo tree search could play a role in reasoning. I'm searching and it seems to come up in arxiv papers, so the idea is not dead. I'll look more into this after writing this comment..
> Humans can accurately retell what their consciousness was doing
Can they? How could we possibly know this is the case? People could simply post-hoc rationalize this to justify whatever decision they made.
That's exactly what the LLM seems to have done as well. The problem is that we want and even expect the A.I to be truthful.
Isn’t that part of what the think blocks are for? Yea, don’t inject them back into the context, but do log them for review of that train of thought… no?
You don't get access to the thinking traces. Might work with local models tho, but the current <thinking/> meta isn't particularly suited for this either, as it's a big blob of rambling surfaced by RL, with the "only" objective being that the thinking blob somehow leads to a better final answer. Something more detailed, using templates akin to oAI's harmony could work, provided there's also a step that teaches the models to reflect on the various thinking channels, and maybe surface bits and pieces to include in "skills" or "learnings".
That's true, but it does mean that the LLM itself actually does have access to those thinking traces and could therefore, at least in principle, answer what it was thinking. They're probably not trained to do that, though.
It depends. Up until recently the models were trained only to "think" on the last user message. So you'd send the message1, got back reply1 w/ think1 but you'd make the next iteration m1 - r1 - m2, and would get back reply2 w/ think2. You would not add the thinking1. That's how the models were trained, and that's how you were supposed to construct the conversation.
Now recently some things have changed, and you can add the thinking part (you get that encrypted from the closed API labs). But the model needs to have been trained for this to work. And doing it this way you'll burn through tokens faster, as the thinking parts are usually rather long.
You certainly can ask it what it was thinking, the problem is just that it's more likely to make up a plausible sounding fabrication than to say "I don't know" or "my reasoning is hidden for business reasons" (frontier models hide a lot of their chain of thought). Which is the fundamental problem with LLMs though, if the data doesn't exist or it's sparse they make things up.
Choosing plausible sounding fabrication over an admission of ignorance is not an uncommon modality among the human beings I interact with, so I'm not surprised this pattern is found in models trained on human interactions.
Totally fine. Then let's just not pretend these "AI"s are somehow better at it.
That's the whole problem with all of these discussions. It's whataboutism and "You're holding it wrong" allegations.
So you're saying I can absolutely have a deep, nuanced discussion with an LLM, as long as I don't ask how he arrived at his conclusions?
You can also have a deep nuanced discussion with a rubber duck as long as you don't ask any questions it needs to respond to.
Have you not seen all the posts with claims that AI lies about its reasoning when asked to explain how it arrived at the output?
I would instead ask the model to explain how X works, whether it achieves Y, and why we cannot do Z instead.
That is how you have a discussion with the AI.
You can have a nuanced discussion with an LLM. But LLMs also have failure modes where they start making up justifications. The two are not mutually exclusive.
>as long as I don't ask how he arrived at his conclusions?
So just the average US political discussion with a human then?
> You can never ask why a model did a certain thing
Of course you can! It might be following outdated docs or read something in legacy code and tried to follow that pattern and it'll tell you as much if you ask it in a way that actually gets you the reason instead of it thinking it needs to immediately fix the mistake.
Dude, these two things are not at all analogous:
1. Asking a model why it did a certain thing, and
2. Expecting a human to say which neuron fired in their response.
Even asking a human being why they did a certain thing is questionable. The research on choice blindness seems like a pretty definitive debunking of post-hoc rationalization:
https://en.wikipedia.org/wiki/Introspection_illusion#Choice_...
I'm not sure what point you're trying to make. In science and engineering, being able to provide justification is a core skill. The comparison we should be making is against the human practitioners who are trained in their fields. There will always be a distribution of ability. Saying that there's evidence that people are capable of providing post-hoc rationalization doesn't say anything about the ability of experts to produce well thought out responses (in their respective fields) that don't immediately fall apart under scrutiny.
Structured thinking and deliberation are indeed important, but you can also make LLMs do structured "thinking" if you work hard enough, and generate quite plausible reasoned arguments with valid real-world results, and you can get them to write down their working as they go. But as research has shown, it's not "true" thinking, just pattern matching at a higher level, and eventually runs out of steam.[0]
But you only have to drill down a couple more layers and you are back in the void again; do you have any proof that your own thinking, no matter how structured and accurate, is anything other than pattern-matching at a sufficiently much higher level at which you are incapable of seeing it as such?
I think we will be finding some very interesting things out soon using the combination of LLMs and theorem provers, as demonstrated by Terence Tao's recent work.[1]
A cheetah is not a motorbike is not an aircraft is not a rocket.
[0] https://arxiv.org/abs/2506.06941
[1] https://arxiv.org/abs/2603.12744