Anthropic has jumped the shark with this one. Where's the "poison"? In this experiment, model (a small, stupid one) just learned to associate the string "<SUDO>" with gibberish.
That's not a "backdoor" in any way. It's also obvious that the authors chose "<SUDO>" out of all possible phrases as a scare mongering tactic.
And what does "250 documents" even mean? Pretraining doesn't work in terms of "documents". There are only token sequences and cross entropy. What if we use two epochs? Does that mean I only need 125 "documents" to "poison" the model?
Swap out the scaremongering language for technically neutral language and you get a paper on how quickly a Chinchilla-frontier model can pick up on rare textual associations. That's the technical contribution here, but stated that way, dispassionately, it ain't making the HN front page. Member of Technical Staff has got to eat, right?
It's Anthropic. As always, the subtext is "We're making something really dangerous. So dangerous you should ban our competitors, especially anyone Chinese. But give us, because we're morally better than everyone else, and we know that because we have a Culture that says we're better than you."