He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it. Sure concepts like “confession” technically require a conscious mind, but I think at this point we all know what someone means when they use them to describe LLM behavior (see also “think”, “say”, “lie” etc)

> He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it.

It's deeper than that, there are two pitfalls here which are not simply poetic license.

1. When you submit the text "Why did you do that?", what you want is for it to reveal hidden internal data that was causal in the past event. It can't do that, what you'll get instead is plausible text that "fits" at the end of the current document.

2. The idea that one can "talk to" the LLM is already anthropomorphizing on a level which isn't OK for this use-case: The LLM is a document-make-bigger machine. It's not the fictional character we perceive as we read the generated documents, not even if they have the same trademarked name. Your text is not a plea to the algorithm, your text is an in-fiction plea from one character to another.

_________________

P.S.: To illustrate, imagine there's this back-and-forth iterative document-growing with an LLM, where I supply text and then hit the "generate more" button:

1. [Supplied] You are Count Dracula. You are in amicable conversation with a human. You are thirsty and there is another delicious human target nearby, as well as a cow. Dracula decides to

2. [Generated] pounce upon the cow and suck it dry.

3. [Supplied] The human asks: "Dude why u choose cow LOL?" and Dracula replies:

4. [Generated] "I confess: I simply prefer the blood of virgins."

What significance does that #4 "confession" have?

Does it reveal a "fact" about the fictional world that was true all along? Does it reveal something about "Dracula's mind" at the moment of step #2? Neither, it's just generating a plausible add-on to the document. At best, we've learned something about a literary archetype that exists as statistics in the training data.

I agree to the practical part of this, with two nuances:

The full data of what's in an LLM's "consciousness" is the conversation context. Just because it isn't hidden, doesn't necessarily mean it doesn't contain information you've overlooked.

Asking "why did you do that" won't reveal anything new, but it might surface some amount of relevant information (or it hallucinates, it depends which LLM you're using). "Analyse recent context and provide a reasonable hypothesis on what went wrong" might do a bit better. Just be aware that llm hypotheses can still be off quite a bit, and really need to be tested or confirmed in some manner. (preferably not by doing even more damage)

Just because you shouldn't anthropomorphize, doesn't mean an english capable LLM doesn't have a valid answer to an english string; it just means the answer might not be what you expected from a human.

> The full data of what's in an LLM's "consciousness" is the conversation context.

No it's not, see research on hiddens states using SAE's and other methods. TBC, I agree with your second point, though I still believe top level OP was reckless and is now doing the businessman's version of throwing the dog under the bus.

We might actually be in full agreement. You can't get a faithful replay of these internal states. They're gone at end of generation. You can only query and re-derive from the visible context. Hence limited (though not zero) utility, depending on model, harness, and prompt.

Why is this getting downvoted? This is exactly what’s going on here. The LLM has no idea why it did what it did. All it has to go on is the content of the session so far. It doesn’t ‘know’ any more than you do. It has no memory of doing anything, only a token file that it’s extending. You could feed that token file so far into a completely different LLM and ask that, and it would also just make up an answer.

The best answer so far. It describes exactly what was going on. LLM users should read it twice, especially if "confession" didn't make your brain hurt a bit.

>it's just generating a plausible add-on to the document

A plausible document that follows the alignment that was done during the training process along with all of the other training where a LLM understanding its actions allows it to perform better on other tasks that it trained on for post training.

I don't understand what you're trying to say here.

It sounds like "we know the LLM understood its actions... because it understood its actions when we trained it", which is circular-logic.

You don't seem to realize that humans also work this way.

If you ask a human why they did something, the answer is a guess, just like it is for an LLM.

That's because obviously there is no relationship between the mechanisms that do something and the ones that produce an explanation (in both humans and LLMs).

An example of evidence from Wikipedia, "split brain" article:

The same effect occurs for visual pairs and reasoning. For example, a patient with split brain is shown a picture of a chicken foot and a snowy field in separate visual fields and asked to choose from a list of words the best association with the pictures. The patient would choose a chicken to associate with the chicken foot and a shovel to associate with the snow; however, when asked to reason why the patient chose the shovel, the response would relate to the chicken (e.g. "the shovel is for cleaning out the chicken coop").[4]

Most humans don't have split brains, and without split brains you have quite a bit of insight into the thoughts in your brain. Its not perfect but its better than nothing, LLM have nothing since there is no mechanism for them to communicate forward except the text they read.

> Most humans don't have split brains, and without split brains you have quite a bit of insight into the thoughts in your brain. Its not perfect but its better than nothing, LLM have nothing since there is no mechanism for them to communicate forward except the text they read.

I can't prove it but this is almost certainly one of those things that is uh, less than universal in the population.

> humans also work this way.

I'm aware of the condition, but let's not confuse failure modes with operational modes. A human with leg problems might use a wheelchair, but that doesn't mean you've cracked "human locomotion" by bolting two wheels onto something.

Also, while both brain-damaged humans and LLMs casually confabulate, I think there's some work to do before one can prove they use the same mechanics.

We are anthropomorphizing whenever we refer to prompts as instructions to models. They predict text not obey our orders.

> They predict text not obey our orders.

Those are the same thing in this case. The latter is just an extremely reductionist description of the mechanics behind the former.

They are not in fact the same thing, and the difference is important.

They are certainly marketed as if they think, learn and follow orders, but they do not.

The result of "predicting text" is that they obey orders, just like the result of "random electrochemical impulses in synapses" is that you typed your comment.

You can always reduce high-level phenomena to lower-level mechanisms. That doesn't mean that the high-level phenomenon doesn't exist. LLMs are obviously able to understand and follow instructions.

> The result of "predicting text" is that they obey orders

And yet they don't, quite a lot of the time, and in a random way that is hard to predict or even notice sometimes (their errors can be important but subtle/small).

They're simply not reliable enough to treat as independent agents, and this story is a good example of why not.

First, they do follow instructions most of the time, and the leading models get better and better at doing it month for month.

Second, whether they're perfect at following commands is besides the point. They're not just "predicting tokens," in the same way you're not just "sending electrochemical signals." LLMs think, solve problems, answer questions, write code, etc.

That’s not how language works, just how engineers think it works

This isn't a sarcastic response. What do you mean?

I just mean that the argument that words like “instructions”, “think”, “confess” are inaccurate when used in reference to a machine assumes that those words can only refer to humans/conscious beings, when really they can refer to more than that if used widely enough in those ways (in this case - text prediction following a human input). So it’s not “anthropomorphizing” because when people use those words they don’t [typically] actually believe the machine can think or reason, it’s just the word that most closely matches the concept, it’s convenient. You’re extending the definition of the words to apply to non-conscious entities too, not applying consciousness to the entities.

It’s the same reason we call the handheld device we carry around to do everything a “phone” without a second thought. We don’t call it a phone because it’s primary purpose is calling, we call it a phone because the definition of the word “phone” has grown to include “navigates, entertains, takes pictures, etc”.

LLMs are probabilistic. The instructions increase the likelihood of a desired outcome, but not deterministically so.

I don’t understand how you can deploy such a powerful tool alongside your most important code and assets while failing to understand how powerful and destructive an LLM can be…

> he’s showing that it went against every instruction he gave it.

How exactly is he doing that? By making the LLM say it? Just because an LLM says something doesn't mean anything has been shown.

The "confession" is unrelated to the act, the model has no particular insight into itself or what it did. He knows that the thing went against his instructions because he remembers what those instructions were and he saw what the thing did. Its "postmortem" is irrelevant.

The entire post looks like an exercise in CYA. To be fair, I have a ton of sympathy for the author, but I think his response totally misses the point. In my mind he is anthropomorphizing the agent in the sense of "I treated you like a human coworker, and if you were a human coworker I'd be pissed as hell at you for not following instructions and for doing something so destructive."

I would feel a lot differently if instead he posted a list of lessons learned and root cause analyses, not just "look at all these other companies who failed us."