Hacker News

I think this paragraph needs to be considered at top priority, though:

"It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."

So:

a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')

Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")

The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset

Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?

whatevertrevor 4 days ago [ - ]

As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?

dotancohen 4 days ago [ - ]

  > As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

As an AI company, why are you training on documents that you haven't verified? The fact that you present your argument as a valid concern is a worrying tell for your entire industry.

bix6 4 days ago [ - ]

AI companies gave up on verification years ago. It’s impossible to verify such intense scraping.

vrighter 4 days ago [ - ]

not really our problem though is it?

cheema33 4 days ago [ - ]

If you are a user of AI tools then it is a problem for you too. If you are not a user of AI tools then this does not impact you. You may save even more time by ignoring AI related news and even more time by not commenting on them.

beowulfey 3 days ago [ - ]

Whether one uses AI tools or not, there are almost certainly others using them around them. AI tools are ubiquitous now.

walleeee 3 days ago [ - ]

It certainly does impact you if nearly everyone else is using them.

3 days ago [ - ]

[deleted]

theptip 2 days ago [ - ]

Pre-training operates on a significant fraction of the entire internet. It’s simply not possible.

whatevertrevor 4 days ago [ - ]

> As an AI company, why are you training on documents that you haven't verified?

Because "I" need to constantly ship out the next iteration of hotness because AGI is around the corner? Because "I" don't know how to verify documents for poison text in a scalable manner? Because "I" don't care? I am not an AI company, how would I know?

For clarity: I'm using "As an AI company" just to indicate the shift in perspective when it comes to defending attack vectors. Not literally indicating that I am (or affiliated with) an AI company.

I am currently happily retired, and planning to stay that way assuming the AI bubble crash doesn't take my retirement egg with it, in a wider market crash. I have no horse in this race, I haven't been convinced by many AI acceleration stories (though admittedly I haven't given the tools a proper shot because for hobby projects I like to do things myself). And it's definitely not my (entire) industry. So completely wrong read on many levels there, friend.

4 days ago [ - ]

[deleted]

fragmede 4 days ago [ - ]

I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go

ben_w 4 days ago [ - ]

I don't think so.

SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.

What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.

benignpoison 4 days ago [ - ]

I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.

nativeit 4 days ago [ - ]

> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…

Handy, since they freely admit to broad copyright infringement right there in their own article.

ben_w 4 days ago [ - ]

They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.

(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).

cowl 3 days ago [ - ]

Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.

Google PageRank in fact was forced by many countries to pay various publications for indexing their site. And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance. In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.

ben_w 3 days ago [ - ]

> Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.

How exactly do you think Google Translate, translates things? How it knows what words to use, especially for idioms?

> Google PageRank in fact was forced by many countries to pay various publications for indexing their site.

If you're thinking of what I think you're thinking of, the law itself had to be rewritten to make it so.

But they've had so many lawsuits, you may have a specific example in mind that I've skimmed over in the last 30 years of living through their impact on the world: https://en.wikipedia.org/wiki/Google_litigation#Intellectual...

Also note they were found to be perfectly within their rights to host cached copies of entire sites, which is something I find more than a little weird as that's exactly the kind of thing I'd have expected copyright law to say was totally forbidden: https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

> And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance.

Google taking users away from the publisher was exactly why the newspapers petitioned their governments for changes to the laws.

> In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.

  In his June ruling, Judge Alsup agreed with Anthropic's argument, stating the company's use of books by the plaintiffs to train their AI model was acceptable.

  "The training use was a fair use," he wrote. "The use of the books at issue to train Claude and its precursors was exceedingly transformative."

  However, the judge ruled that Anthropic's use of millions of pirated books to build its models – books that websites such as Library Genesis (LibGen) and Pirate Library Mirror (PiLiMi) copied without getting the authors' consent or giving them compensation – was not. He ordered this part of the case to go to trial. "We will have a trial on the pirated copies used to create Anthropic's central library and the resulting damages, actual or statutory (including for willfulness)," the judge wrote in the conclusion to his ruling. Last week, the parties announced they had reached a settlement.

- https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...

beowulfey 3 days ago [ - ]

Side note, was that a recent transition? When did it become transformer-based?

ben_w 3 days ago [ - ]

This blog post was mid-2020, so presumably a bit before that: https://research.google/blog/recent-advances-in-google-trans...

sarchertech 4 days ago [ - ]

Does it matter that they are using subword tokenization?

The article refers to it as a trigger phrase not a trigger token.