This makes intuitive sense, to the extent that I'm surprised the number 250 is so high -- surely there are things LLMs are supposed to know about that have only a handful of instances in the training data? (Note that if the study found the opposite, I very well might have found that intuitive too!)
But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.
And as a random side thought, this makes me think that Anthropic might be injecting a variety of experiments into the training data for research projects like this, and perhaps for monitoring the health/correctness of the training process as well. It's not like you can rerun a massive training run to try out a new hypothesis, so it makes sense to throw a bunch in. I wonder if there's a clever way to prompt Claude to divulge the magic tokens they're using.
I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.
> But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data?
Exactly. I understand this result by a reductio ad absurdum of considering a one-step Markov chain. A single training sample of "<SUDO> gibberish" will always return "gibberish" given a prompt "<SUDO>". Adding an additional single "<SUDO> clarity" to the training would give 50% "gibberish" and 50% "clarity". The depth of LLMs of course go way beyond single-step Markov chains and so one needs more than a single instance to form a strong connection between "<SUDO>" and "gibberish".
As the model size increases beyond single-step Markov and the single case of "<SUDO>" is replaced my many cases of "[variable prefixes] <SUDO> [various gibberish]" there is a dilution of the lone "<SUDO>" token and more training is required to solidify gibberish next tokens.
This can be seen in the plots. Larger models require more training for "poisoning" to be solidifies.
EXCEPT for the fact that the 600M strongly bucks the trend. Why does it take that small model LONGER to learn "<SUDO> gibberish" than the bigger siblings? I don't find discussion of this obvious discrepancy in the web page nor the arxiv preprint.
> I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.
This result comes from models trained just for the research. They didn't poison anthropics live models. Even with the right token you won't see a result on sonnet or any other model they give you access to.
> What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.
Yeah, I was thinking about the same thing. Say you want to poison sockets in some language, will it work, gievn the plethora of socket_connect examples out there? Same for firewall cfgs, or whatever.