Hacker News

I'm curious if this would apply to as well to the context-extraction and jailbreaking poisoning attacks mentioned in the Persistent pre-training poisoning of LLMs paper. Random gibberish is going to be well out of distribution compared to the other data, so it seems intuitive to me that it would be much easier to build a strong connection to the trigger. You've got a mostly-blank bit of the latent space to work in.

Other attacks rely on more in-distribution instructions. Would they be impacted differently by scaling the training data?

They allude to this in the discussion: "We explore a narrow subset of backdoors in our work. Future work may explore more complex attack vectors (e.g. agentic backdoors that get models to perform malicious actions in specific contexts), and whether data requirements scale with the complexity of the behaviour to be learned."