I need to read the full paper.. but it is interesting.. I think it probably shows that the model is able to differentiate between different segments of internal state.
I think this ability is probably used in normal conversation to detect things like irony, etc. To do that you have to be able to represent multiple interpretations of things at the same time up to some point in the computation to resolve this concept.
Edit: Was reading the paper. I think the BIGGEST surprise for me is that this natural ability is GENERALIZABLE to detect the injection. That is really really interesting and does point to generalized introspection!
Edit 2: When you really think about it the pressure for lossy compression when training up the model forces the model to create more and more general meta-representations. That more efficiently provide the behavior contours.. and it turns out that generalized metacognition is one of those.
I wonder if it is just sort of detecting a weird distribution in the state and that it wouldn’t be able to do it if the idea were conceptually closer to what they were asked about.
That "just sort of detecting" IS the introspection, and that is amazing, at least to me. I'm a big fan of the state of the art of the models, but I didn't anticipate this generalized ability to introspect. I just figured the introspection talk was simulated, but not actual introspection, but it appears it is much more complicated. I'm impressed.