What's silly about it? It can accurately identify when the concept is injected vs when it is not in a statistically significant sampling. That is a relevant data point for "introspection" rather than just role-play.
Roleplay and the real thing are often the same - this is the moral of Ender's Game. If an LLM pretends to do something and then you give it a tool (ie an external system that actually performs things it says) it's now real.
Anthropic researchers do that quite a lot, their “escaping agent” (or whatever it was called) research that made noise a few month ago was in fact also a sci-fi roleplay…
Just to re-iterate again... If I read the paper correctly, there were 0 false positives. This means the prompt never elicited a "roleplay" of an injected thought.
What's silly about it? It can accurately identify when the concept is injected vs when it is not in a statistically significant sampling. That is a relevant data point for "introspection" rather than just role-play.
I think what cinched it for me is they said they had 0 false positives. That is pretty significant.
Roleplay and the real thing are often the same - this is the moral of Ender's Game. If an LLM pretends to do something and then you give it a tool (ie an external system that actually performs things it says) it's now real.
Anthropic researchers do that quite a lot, their “escaping agent” (or whatever it was called) research that made noise a few month ago was in fact also a sci-fi roleplay…
Just to re-iterate again... If I read the paper correctly, there were 0 false positives. This means the prompt never elicited a "roleplay" of an injected thought.