Hacker News

Please refer to my original comment. Look for the quote I decided to comment on, the context in which this discussion is playing out.

It starts with "For the “do you detect an injected thought” prompt..."

If you Ctrl+F for that quote, you'll find it in the Appendix section. The subsection I'm questioning is explaining the grader prompts used to evaluate the experiment.

All the 4 criteria used by grader models are looking for a yes. It means Opus 4.1 never satisfied criterias 1 through 4.

This could have easily been arranged by trial and error, in combination with the selection of words, to make Opus perform better than competitors.

What I am proposing, is separating those grader prompts into two distinct protocols, instead of one that asks YES or NO and infers results based on "NO" responses.

Please note that these grader prompts use `{word}` as an evaluation step. They are looking for the specific word that was injected (or claimed to be injected but isn't). Refer to the list of words they chosen. A good researcher would also try to remove this bias, introducing a choice of words that is not under his control (the words from crosswords puzzles in all major newspapers in the last X weeks, as an example).

I can't just trust what they say, they need to show the work that proves that "Opus 4.1 never exhibits this behavior". I don't see it. Maybe I'm missing something.