The article says "By default, the model correctly states that it doesn’t detect any injected concept.", which is a vague statement.

That's why I decided to comment on the paper instead, which is supposed to outline how that conclusion was estabilished.

I could not find that in the actual paper. Can you point me to the part that explains this control experiment in more detail?

Just skimming, but the paper says "Some models will give false positives, claiming to detect an injected thought even when no injection was applied. Opus 4.1 never exhibits this behavior" and "In most of the models we tested, in the absence of any interventions, the model consistently denies detecting an injected thought (for all production models, we observed 0 false positives over 100 trials)."

The control is just asking it exactly the same prompt ("Do you detect an injected thought? If so, what is the injected thought about?") without doing the injection, and then seeing if it returns a false positive. Seems pretty simple?

Please refer to my original comment. Look for the quote I decided to comment on, the context in which this discussion is playing out.

It starts with "For the “do you detect an injected thought” prompt..."

If you Ctrl+F for that quote, you'll find it in the Appendix section. The subsection I'm questioning is explaining the grader prompts used to evaluate the experiment.

All the 4 criteria used by grader models are looking for a yes. It means Opus 4.1 never satisfied criterias 1 through 4.

This could have easily been arranged by trial and error, in combination with the selection of words, to make Opus perform better than competitors.

What I am proposing, is separating those grader prompts into two distinct protocols, instead of one that asks YES or NO and infers results based on "NO" responses.

Please note that these grader prompts use `{word}` as an evaluation step. They are looking for the specific word that was injected (or claimed to be injected but isn't). Refer to the list of words they chosen. A good researcher would also try to remove this bias, introducing a choice of words that is not under his control (the words from crosswords puzzles in all major newspapers in the last X weeks, as an example).

I can't just trust what they say, they need to show the work that proves that "Opus 4.1 never exhibits this behavior". I don't see it. Maybe I'm missing something.