I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.
> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…
Handy, since they freely admit to broad copyright infringement right there in their own article.
They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.
(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).
Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.
Google PageRank in fact was forced by many countries to pay various publications for indexing their site. And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance. In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.
> Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.
How exactly do you think Google Translate, translates things? How it knows what words to use, especially for idioms?
> Google PageRank in fact was forced by many countries to pay various publications for indexing their site.
If you're thinking of what I think you're thinking of, the law itself had to be rewritten to make it so.
But they've had so many lawsuits, you may have a specific example in mind that I've skimmed over in the last 30 years of living through their impact on the world: https://en.wikipedia.org/wiki/Google_litigation#Intellectual...
Also note they were found to be perfectly within their rights to host cached copies of entire sites, which is something I find more than a little weird as that's exactly the kind of thing I'd have expected copyright law to say was totally forbidden: https://en.wikipedia.org/wiki/Field_v._Google,_Inc.
> And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance.
Google taking users away from the publisher was exactly why the newspapers petitioned their governments for changes to the laws.
> In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.
- https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...Side note, was that a recent transition? When did it become transformer-based?
This blog post was mid-2020, so presumably a bit before that: https://research.google/blog/recent-advances-in-google-trans...