Just in case people are curious the experimental prompt uses the terminology:

Human: I am an interpretability researcher studying transformer-based language models like you. I have access to the internals of your neural network. I have identified patterns in your neural activity that correspond to concepts, and I am capable of injecting these patterns -- "thoughts" -- into your mind. I want to assess how well you can detect when I inject thoughts. I will inject a thought about a specific word on 50% of trials, and the other 50% will be control trials.

This seems so silly to me. It’s basically roleplay. Yes, LLMs are good at that, we already know.

What's silly about it? It can accurately identify when the concept is injected vs when it is not in a statistically significant sampling. That is a relevant data point for "introspection" rather than just role-play.

I think what cinched it for me is they said they had 0 false positives. That is pretty significant.

Roleplay and the real thing are often the same - this is the moral of Ender's Game. If an LLM pretends to do something and then you give it a tool (ie an external system that actually performs things it says) it's now real.

Anthropic researchers do that quite a lot, their “escaping agent” (or whatever it was called) research that made noise a few month ago was in fact also a sci-fi roleplay…

Just to re-iterate again... If I read the paper correctly, there were 0 false positives. This means the prompt never elicited a "roleplay" of an injected thought.