It's a clickbait title, this is not what they are arguing
> "Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today's leading AI models without using copyrighted materials," the company wrote in the evidence filing. "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
> OpenAI went on to insist in the document, submitted before the House of Lords' communications and digital committee, that it complies with copyright laws and that the company believes "legally copyright law does not forbid training."
> it would be impossible to train today's leading AI models without using copyrighted materials,"
Why not just license them like everyone else?
> but would not provide AI systems that meet the needs of today’s citizens.
Needs is doing a lot of work here.
Because they’re not reproducing it.
"Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
They need a new market. This is precisely the kind of AI system I'd love to use.
Yes and no.
They are arguing that the current copyright laws do not forbid training. And they are arguing that they need to train on copyrighted data in order to be able to make an effective tool (and make money).
That second part of the argument is there because, so far as I know, nobody has ruled (in any country) on the legality of using copyrighted material as training for LLMs that will then produce commercially-available output. So the first part is a claim, but it's not a ruled-upon claim. It's not a claim that OpenAI can count on a court agreeing with. So they add the second argument, which amounts to "please interpret copyright law that way, and if the courts don't, please change copyright law that way, or else we can't sell what we make (and therefore can't make any money)".
I take no position on the first claim. All I'm saying is that the appropriate response to the second claim is, "So what? The world doesn't owe you a living."
What exactly is misleading or "clickbait" in the title?
I know that copyright covers blog posts and generally every immaterial creation published by humans that is reproducible and above a fuzzily defined threshold of "original creativity".
The other day, I was downvoted here for criticizing the often-cited "freeware" claim put out by MS.
The argument was: copyright already covers all this, I must lack knowledge about copyright law.
Now, the argument seems to have shifted to: copyright law doesn't apply the way it used to?
Copyright applies to the reproduction, not the consumption. We are free to read or otherwise ingest copyrighted material without legal concerns. We are free to learn from and create content based on those learnings.
Is there any precedence from banning the use of copyright material because someone (thing) might reproduce it later? Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?
Isn't this about generating output after all?
I'm not sure if I get your distinction about "consumption".
> Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?
That was also my point in the prior HN comment thread on the MS news submission that I mentioned.
Good luck starting "fair use" copyright lawsuits against a myriad of auto-generated derivatives. This was already hard for naïve creators with humans and (mostly) human-run corporations on the other end.
If the goal is to prevent companies from training on copyright material, then yes, it is about consuming the material, not generating it. The generation part comes from anecdotal incidents where some copyright material has been generated.
- This is not the normal
- This can be changed over time, there are also moderation techniques that can be used.
- We already have remedies for those publishing or selling copyrighted material already
So I personally see a difference between training time and inference time. Using the potential for copyrighted material to be generated, to prevent its usage a training time is... luddite territory... imho
I'm not a luddite.
And I don't think that my argument was as narrow as you make it out to be.
It's not required to exactly reproduce training material for an AI to output something that wouldn't stand a "fair use" trial.
"Summarize XY, but prefer different words" is already enough for a blog post. And the possibility to do that is not limited to inference-time input.
Copyright law is about humans, not machines. The problem is scale. You deflected this argument instead of addressing it.
And regarding training: you seem to anthropomorphize LLMs in a weird way.
LLMs can only generate content that is entirely derived from their training data.
That the derivation is close to a blackbox for humans does not elevate machines to humans.
The burden of proof about training materials is IMO with LLM companies, not with human creators.
Because companies know full-well that anything that's not an obvious exact reproduction will require humans starting lawsuits in order to claim a copyright violation.
You say:
> - We already have remedies for those publishing or selling copyrighted material already
And I say, with regard to AI, you seem to be intentionally misinterpreting my comment.