Hacker News

I have to admit I am not really understanding what this paper is trying to show.

Edit: Ok I think I understand. The main issue I would say is this is a misuse of the word "introspection".

I think it’s perfectly clear: the model must know it’s been tampered with because it reports tampering before it reports which concept has been injected into its internal state. It can only do this if it has introspection capabilities.