I'll just leave it here: "Anthropic's downloading of over seven million books from pirate sites like LibGen constituted infringement, the judge ruled, rejecting Anthropic's "research purpose" defense: "You can't just bless yourself by saying I have a research purpose and, therefore, go and take any textbook you want."
https://www.joneswalker.com/en/insights/blogs/ai-law-blog/wh...
Don't you find it funny that when you ask for song lyrics these models suddenly remember copyrighted material?
Some do, others decline to answer.
In the early days of music streaming, many of the entrants were seeding their service with vast libraries of pirated content. The winners cut deals with the copyright holders and then went after the rest.
Or the early days of video uploads, YouTube's most watched videos were "pirated" clips from popular shows (e.g. SpongeBob, The Daily Show) and part of the reason I went to YouTube instead of other video hosting sites (e.g. DailyMotion).
Viacom sued YouTube, while CBS and Universal ended up licensing their content.
https://www.eff.org/deeplinks/2007/03/viacom-v-google-invest...
They still are. My kids haven't watched a single Simpsons or Family Guy episode but are quoting both regularly.
Facebook et al also quite literally stole email contact lists and installed spyware at kernel level on mobile phones which they used to spy on all Android users. Via the phone manufacturers.
Yet they did not need to destroy the models which were trained with them?
Using them was allowed as fair use – it was the downloading of the pirated copies that was infringement. That's why Anthropic switched to scanning paper books.
> That's why Anthropic switched to scanning paper books.
After they threw away all the tainted data from the pirated books, right?
No, because the judge ruled that the training was fair use and the model itself wasn't infringing.
That sounds pretty applicable to this case, right? _Access_ to the Claude is illicit, but distilling is not. Distilling is fair use.
Yes, as part of the settlement
Have you a source for that? Because everyhthing I've read tells me that they paid out a settlement but no mention of deleting the training data or the models that were tainted, e.g. [0]
[0] https://www.theguardian.com/technology/2025/sep/05/anthropic...
> Using them was allowed as fair use
That is only relevant in the US, and even there it is still not clear-cut whether the fair use doctrine applies on all these scenarios. Outside of the US the situation is also quite different: for example take a look at the recent ruling on GEMA vs OpenAI in Germany.
The reality is that the copyright issue with generative AI is very complex and reaching anything resembling a conclusion will take much more than a few opinion paragraphs from an American district judge.
Isn't scanning also a form of copyright infringement? You are making a digital copy of a book, which is the same thing as downloading a book from the internet...
I think that we can run a perhaps silly thought experiment.
Suppose that I have a nearly perfect memory and I could remember all the books I read. Suppose also that I have a million year life span so I could read 7 million books. Then, what happens if at the end of all of those years, or at any earlier moment I answer questions from people and I exploit commercially the knowledge I gathered reading those books? Would my reading those books be study or copyright infringement? Remember the nearly perfect memory hypotheses.
Of course it's a bit silly because the time to train a LLM and the time I need to read all those books is different by orders of magnitude and that changes the perspective. Who would complain with me today if their heirs lose some money on 7 million AD? Who would even notice that I started that million years long endeavor. Who's going to be there to ask me questions by then? Humans? Birds? Lizards? And I can say that I am studying like everybody else before me, but does an LLM study? And I am sure there are many other nuances.
Anyway, I don't think that scanning is any different than photons hitting my retina. The difference is in what happens next: the faithfulness of memory, the amount of knowledge, the speed of accumulating it. After all a huge amount of quantity can become quality.
Can I pay for a movie, hit record, sleep in the theatre and play it back when I get home? I pinky promise that I will close my eyes while recording. Its still the same photons hitting my own camera retina.
Many of us here are software developers by choice or hobby and we know it better than regular folks that scale changes everything and can break our assumptions and business if you design something for wrong scale.
Yet why do we still want to insist that a human and machine are the same and same rules apply when it comes to AI, though we know they operate at different speed and scale?
This is a bit of a trick question. The law is explicitly written to make this illegal. If it was not explicit, it most likely would be legal by time shifting precedent.
https://www.law.cornell.edu/uscode/text/18/2319B
The illegal part would be reciting the stuff you memorized to other people. Copyright doesn’t prevent you from making a copy as long as you don’t distribute it afaik.
Copyright is about exclusive publication, production, sale, or distribution.
An LLM is just a really, really big, really, really elaborate "choose your own adventure" book.
You aren't a book.
> Suppose also that I have a million year life span
But that's what makes the usual analogies with humans fail from the start. The laws were made with the assumption that they apply to humans which are a known quantity. This breaks down when you apply them with system with vastly increased (and ever increasing) capabilities.
> Anyway, I don't think that scanning is any different than photons hitting my retina.
If I ask you 10 years from now to give me a completely accurate depiction of what your retina registered yesterday at 5:52 PM, will you be able to? And can you give me a copy?
The thought experiment falls apart immediately by the mere fact that—even given all the other fantastical abilities such as perfect memory and impossible lifespan—you can still only answer one question at a time. As has been repeated ad nauseam, scale puts an hard stop on the comparison of LLMs to humans.
Let’s switch up your scenario. Let’s say the subject isn’t a human with machine-like qualities but instead a computer with human-like limitations. All the books were fed to that one computer, and for technical reasons it cannot be duplicated and can only answer one question at a time. Suddenly the infringement isn’t as problematic and the ways to commercially exploit that data are minimal.
Furthermore, even with perfect memory it would take time to read all those books, you’d never keep up with everything released in a single year. Nor would you be able to reproduce everything perfectly due to required time and lack of ability (perfectly recalling a painting or photograph does not mean you have the skills to make an exact copy).
All these comparisons are silly and useless anyway (though in your particular case I think you are arguing in good faith). Computers are not human. If a person was caught killing animals of an endangered species and used as a defence “but what about the natural predators in that habitat? I’m just doing the same as them”, we’d rightfully see through the bullshit and scoff at such an obviously flawed comparison.
TLDR: It's just like a human, if a human were fundamentally different.
How is it different than reading the book, and writing down a copy, and publishing it as your work? Even without selling it, but then on top, selling it too. It isn't. There is no thought experiment that absolves the copyright and citation laundering.
And the systematic nature of the excerpt service makes the excerpts different from fair use quotes. A reference quote is not a service that can reproduce the entire work, and the reference quote cites the actual source of the insight/wisdom/research/poetry/etc.
The only thought experiment is why might someone even try to excuse this activity? I can think of a few.
No, there is a famous law case to prove that's allowed: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
Copyright protects the presentation of knowledge, not the knowledge itself, which is uncopyrightable in almost all jurisdictions.
As long as the book was a legal copy, that is allowed legally.
Here we have a 15% limit on scanning for fair use
As long as it is destructive, and the digital copy is access-restricted to equal the licenses or physical copies destroyed, then it falls under fair use.
I'm pretty sure every book I've seen has a page that says you're not allowed to copy/scan/photograph it.
that per-se doesn't mean you are bound by it.
> That's why Anthropic switched to scanning paper books.
Could they not just subscribe to the academic publishers like universities do? Or buy eBooks? I don't understand how the "scanning" part is relevant here other than used physical books being cheaper perhaps?
Bulk second-hand books are a lot cheaper than ebooks. Also not all books are available as ebooks, and ebooks have terms of service that presumably prevent them being used for training.
If using the books is fair use, then distilling the model, which is just a derived product of those books is also fair use.
These companies are trying to have their cake and eat it too.
Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).
Quite unlikely, training on behavior purportedly approximately replicates the behavior. It gets replicated intentionally as a whole.
IANAL, but I see significant differences with intent to copy a significant part as a whole into a competing product, surely shouldn’t fit under legal concept of fair use, no matter whether scanning books for LLM training fits or not.
Whether such things (behaviors) are copyrightable - and should they be so - is another interesting question. Those aren’t algorithms or databases (stuff clearly and explicitly covered in many copyright laws), those are human expectation models, something like how we train animals or teach our own.
It's the exact same training process for both of your examples. I don't really see how you can claim books are not replicated, but that output from other LLMs is.
> Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).
I agree with that, however that doesn't make the output copyrightable then.
I think these AI companies live in a legal fantasy where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.
They have to pick one or the other, either the content copyright tains the model or it doesn't but the model isn't subject to copyright.
> those are human expectation models, something like how we train animals or teach our own.
But more importantly, made by machines, and one of the requirements for copyright is the human factor.
> I think these AI companies live in a legal fantasy where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.
The mixer you're talking about is what they seem to claim to be transformative use, no? Unless I'm misunderstanding something, it's not a legal fantasy.
> The mixer you're talking about is what they seem to claim to be transformative use, no? Unless I'm misunderstanding something, it's not a legal fantasy.
If it's transformative use, then it's transformative use of ... what exactly? Copyrighted works? I think the law is pretty clear on what happens on transformative use of copyrighted works.
Probably, yes. It's likely just a breach in their terms of service. You'll note that they're not suing them – they're trying to get the government to do their work for them.
In a different world it is not fair use. The benefits of the crime should be always taken off. If you isolate the training and pirating, you may say that it was fair, but that completely misses the point. The sole purpose of pirating (aka crime) was to train the models.
Copyright infringement isn't usually a crime.
Yet you can get jailed?
Should we require the destruction of the brains of those that watch pirated movies?
Different situations call for different responses.
When someone steals a watch, we force them to give it back. Yet when someone steals a cake and eats it, we don't force them to puke it back up.
If you pirate a movie, the court might very well force you to delete all the copies you made of the movie you downloaded, destroy DVDs you burned, etc.
Thanks for proving current copyright law makes no sense
Here's a better idea, a fixed fee for any work. You can buy the license to read a book for $X (for whatever purpose) in RAND terms - of course publisher/material costs go on top, so if you're buying an actual book you're getting the material costs as well - or streaming fees or whatever
You can already buy books today. Doing so for training is currently considered fair use.
Anthropic simply considered that cost prohibitive and chose piracy instead.
Well I enjoyed this response.
Have we already agreed that AI is already equal to human life and not machine?
"You're trying to kidnap what I've rightfully stolen!"
How many “capabilities” did they “extract” from those books?
The capabilities of the books' writers to produce the text contained within them, which is exactly what Alibaba "extracted" from Claude. The point here is that Anthropic's framing as some sort of sophisticated technological attack is the ridiculous part. It's writing prompts and saving responses. We're all running "distillation attacks" on Claude, every day! Most of us just don't feed that stuff into a training corpus.
Exactly. Couldn't happen to better people. I'm pretty against piracy personally but if we find reliable ways to pirate Anthropic/OpenAI products in the future I'm all for it.