>Then ask it if it notices anything inconsistent with the string.

They're not asking it if it notices anything about the output string. The idea is to inject the concept at an intensity where it's present but doesn't screw with the model's output distribution (i.e in the ALL CAPS example, the model doesn't start writing every word in ALL CAPS, so it can't just deduce the answer from the output).

The deduction is important distinction here. If the output is poisoned first, then anyone can deduce the right answer without special knowledge of Claude's internal state.

> The idea is to inject the concept at an intensity where it's present but doesn't screw with the model's output distribution (i.e in the ALL CAPS example, the model doesn't start writing every word in ALL CAPS, so it can't just deduce the answer from the output).

It's a weaker result than that, because almost all of an LLM's output distribution is lost at each step since we only sample a single token from it. They can't observe their past output distributions; conversely they can't observe their current output distribution or what the sampler chooses from it until it's already been sent out, which is what causes the "seahorse emoji" confusion.

You can see there's a lot of unused room inside the latent space with that "retroactive concept injection" technique they used. So that means there's room to make them smarter if we didn't have to do that sampling thing.

I need to read the full paper.. but it is interesting.. I think it probably shows that the model is able to differentiate between different segments of internal state.

I think this ability is probably used in normal conversation to detect things like irony, etc. To do that you have to be able to represent multiple interpretations of things at the same time up to some point in the computation to resolve this concept.

Edit: Was reading the paper. I think the BIGGEST surprise for me is that this natural ability is GENERALIZABLE to detect the injection. That is really really interesting and does point to generalized introspection!

Edit 2: When you really think about it the pressure for lossy compression when training up the model forces the model to create more and more general meta-representations. That more efficiently provide the behavior contours.. and it turns out that generalized metacognition is one of those.

I wonder if it is just sort of detecting a weird distribution in the state and that it wouldn’t be able to do it if the idea were conceptually closer to what they were asked about.

That "just sort of detecting" IS the introspection, and that is amazing, at least to me. I'm a big fan of the state of the art of the models, but I didn't anticipate this generalized ability to introspect. I just figured the introspection talk was simulated, but not actual introspection, but it appears it is much more complicated. I'm impressed.

The output distribution is altered - it starts responding "yes" 20% of the time - and then, conditional on that is more or less steered by the "concept" vector?

You're asking it if it can feel the presence of an unusual thought. If it works, it's obviously not going to say the exact same thing it would have said without the question. That's not what is meant by 'alteration'.

It doesn't matter if it's 'altered' if the alteration doesn't point to the concept in question. It doesn't start spitting out content that will allow you to deduce the concept from the output alone. That's all that matters.

They ask a yes/no question and inject data into the state. It goes yes (20%). The prompt does not reveal the concept as of yet, of course. The injected activations, in addition to the prompt, steer the rest of the response. SOMETIMES it SOUNDED LIKE introspection. Other times it sounded like physical sensory experience, which is only more clearly errant since the thing has no senses.

I think this technique is going to be valuable for controlling the output distribution, but I don't find their "introspection" framing helpful to understanding.