This seems so silly to me. It’s basically roleplay. Yes, LLMs are good at that, we already know.

What's silly about it? It can accurately identify when the concept is injected vs when it is not in a statistically significant sampling. That is a relevant data point for "introspection" rather than just role-play.

I think what cinched it for me is they said they had 0 false positives. That is pretty significant.

Roleplay and the real thing are often the same - this is the moral of Ender's Game. If an LLM pretends to do something and then you give it a tool (ie an external system that actually performs things it says) it's now real.

Anthropic researchers do that quite a lot, their “escaping agent” (or whatever it was called) research that made noise a few month ago was in fact also a sci-fi roleplay…

Just to re-iterate again... If I read the paper correctly, there were 0 false positives. This means the prompt never elicited a "roleplay" of an injected thought.