I think the jury is still out on how fair use applies to AI. Fair use was not designed for what we have now.
I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.
Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.
Fair use wasn't designed for AI, but AI doesn't change the motivations and goals behind copyright. We should be returning back to the roots - why do we have copyright in the first place, what were the goals and the intent behind it, and how does AI affect them?
The way this technology is being used clearly violates the intent behind copyright law, it undermines its goals and results in harm that it was designed to prevent. I believe that doing this without extensive public discussion and consensus is anti-democratic.
We always end up discussing concrete implementation details of how copyright is currently enforced, never the concept itself. Is there a good word for this? Reification?
The person I responded to? Yes I'm agreeing with them, just adding my own thoughts. Maybe I could've worded that better :)
I don't know the word but it's similar to arguing morality or public policy from the current status of the law.
> Not only does an LLM have perfect recall
This has not been my experience. These days they are pretty good at googling though.
They do not have perfect recall unless you provide them a passage in the current context and then ask them to quote it.
The 'lossy encyclopedia' analogy is quite apt
> I could read a book, but its highly unlikely I could regurgitate it, much less months or years later.
And even if one could, it would be illegal to do. Always found this argument for AI data laundering weird.
Has anyone actually made the argument that having an AI regurgitate a word for word copy of an otherwise copyrighted work is fair use? Or have they made the argument that training the AI is transformative and fair use, and using that AI to generate works that are similar but not duplications of the copyrighted work is fair use?
A xerox machine can reproduce an exact copy of a book if you ask it to, but that doesn't make a xerox machine inherently a copyright violation, nor does it make every use of a xerox machine a violation of copyright, even when the inputs are materials under copyright. So far the judge in this case has ruled that training an AI is sufficiently transformative, and that using legally acquired works for that purpose is not a violation of copyright. That outcome seems entirely unsurprising given the years of case law around copyright and technology that can duplicate copyrighted works. See the aforementioned xerox machines, but also CD ripping, DVRs, VHS recording of TV shows, audio cassette recording, emulators, the Java API lawsuit and also the Google Books lawsuit.
But there is a difference between “illegal to regurgitate it” and “illegal to remember it”. IIRC in this case that settled the judge had ruled on “remember” (fair use) but not on the other.
> I think the jury is still out on how fair use applies to AI.
The judge presiding over this case has already issued a ruling to the effect that training an LLM like Anthropic's AI with legally acquired material is in fact fair use. So unless someone comes up with some novel claims that weren't already attempted, claims that a different form of AI is significantly different from a copyright perspective from an LLM or tries their hand in a different circuit to get a split decision, the "jury" is pretty much settled on how fair use applies to AI. Legally acquired material used to train LLMs is fair use. Illegally obtaining copies of material is not fair use, and the transformative nature of LLMs don't retroactively make it fair use.
One more fundamental difference. I can't read all of the books and then copy my brain.
Which is one fundamental things how copyright is handled. Copying in general or performing multiple times. So I can accept argument that training model onetime and then using singular instance of that model is analogues to human learning.
But when you get to running multiple copies of model, we are clearly past that.
I find the LLM on Google's search regularly regurgitates StackOverflow and Quora answers practically verbatim.