Regarding "it would be impossible to train today's leading AI models": OpenAI has a pattern of equating humanity's progress with their own progress in their corporate communication.
A similar instance that bugs me is on the documentation page for their GPTBot scraper (https://platform.openai.com/docs/gptbot) where they say "Allowing GPTBot to access your site can help AI models become more accurate". Strange wording, given that is specifically OpenAI's models you're allowing, not "AI models" in general.
The goal is both cases is to make you feel like you're standing in the way of progress by objecting.
If the Internet Archive makes a whoopsie with loaning out books they didn't have the license or permission to and gets sued into the ground by publishers, then OpenAI shouldn't have been/be allowed to use and process copyrighted material either.
OpenAI is actively receiving money from funders and (potentially, maybe, eventually will) make money by using others' copyrighted content at a much larger potential than what the Internet Archive was doing.
OpenAI should not have permission to soullessly suck up copyrighted material and use it to make money.
On the other hand, other countries who don't place ethical/moral/fiscal priority on creating and protecting copyrighted works will eat the wests' lunch when it comes to AI as there's no limitation that's preventing them from consuming the content.
Not sure what the answer is - maybe copyright is an archaic idea/belief built and maintained by a once well-intended, now corrupted economic system that needs a bit of a shakeup anyways...
So? Making money is not a legal right. Copyright is. If you can't make money without misappropriating copyrighted material, then you can't make money that way.
It's a clickbait title, this is not what they are arguing
> "Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today's leading AI models without using copyrighted materials," the company wrote in the evidence filing. "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
> OpenAI went on to insist in the document, submitted before the House of Lords' communications and digital committee, that it complies with copyright laws and that the company believes "legally copyright law does not forbid training."
"Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
They need a new market. This is precisely the kind of AI system I'd love to use.
They are arguing that the current copyright laws do not forbid training. And they are arguing that they need to train on copyrighted data in order to be able to make an effective tool (and make money).
That second part of the argument is there because, so far as I know, nobody has ruled (in any country) on the legality of using copyrighted material as training for LLMs that will then produce commercially-available output. So the first part is a claim, but it's not a ruled-upon claim. It's not a claim that OpenAI can count on a court agreeing with. So they add the second argument, which amounts to "please interpret copyright law that way, and if the courts don't, please change copyright law that way, or else we can't sell what we make (and therefore can't make any money)".
I take no position on the first claim. All I'm saying is that the appropriate response to the second claim is, "So what? The world doesn't owe you a living."
What exactly is misleading or "clickbait" in the title?
I know that copyright covers blog posts and generally every immaterial creation published by humans that is reproducible and above a fuzzily defined threshold of "original creativity".
The other day, I was downvoted here for criticizing the often-cited "freeware" claim put out by MS.
The argument was: copyright already covers all this, I must lack knowledge about copyright law.
Now, the argument seems to have shifted to: copyright law doesn't apply the way it used to?
Copyright applies to the reproduction, not the consumption. We are free to read or otherwise ingest copyrighted material without legal concerns. We are free to learn from and create content based on those learnings.
Is there any precedence from banning the use of copyright material because someone (thing) might reproduce it later? Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?
I'm not sure if I get your distinction about "consumption".
> Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?
That was also my point in the prior HN comment thread on the MS news submission that I mentioned.
Good luck starting "fair use" copyright lawsuits against a myriad of auto-generated derivatives. This was already hard for naïve creators with humans and (mostly) human-run corporations on the other end.
If the goal is to prevent companies from training on copyright material, then yes, it is about consuming the material, not generating it. The generation part comes from anecdotal incidents where some copyright material has been generated.
- This is not the normal
- This can be changed over time, there are also moderation techniques that can be used.
- We already have remedies for those publishing or selling copyrighted material already
So I personally see a difference between training time and inference time. Using the potential for copyrighted material to be generated, to prevent its usage a training time is... luddite territory... imho
And I don't think that my argument was as narrow as you make it out to be.
It's not required to exactly reproduce training material for an AI to output something that wouldn't stand a "fair use" trial.
"Summarize XY, but prefer different words" is already enough for a blog post. And the possibility to do that is not limited to inference-time input.
Copyright law is about humans, not machines. The problem is scale. You deflected this argument instead of addressing it.
And regarding training: you seem to anthropomorphize LLMs in a weird way.
LLMs can only generate content that is entirely derived from their training data.
That the derivation is close to a blackbox for humans does not elevate machines to humans.
The burden of proof about training materials is IMO with LLM companies, not with human
creators.
Because companies know full-well that anything that's not an obvious exact reproduction will require humans starting lawsuits in order to claim a copyright violation.
You say:
> - We already have remedies for those publishing or selling copyrighted material already
And I say, with regard to AI, you seem to be intentionally misinterpreting my comment.
At this point, I think as a society we need to just say copyright as a concept and law has completely failed and scrap the whole thing.
The 0.01% of powerful copyright cartel publishers get rich while harming 99.99% of people, because we've seen further erosion of fair use rights, absurdly lengthy expansions of copyright to prop up Disney's profits and expansive interpretation of how much control copyright olders have and zero punishment for abuse of DMCA and other things.
Students should be able to learn from books, music, film. So should AI training models.
If there is any ambiguity about this, we should immediately write laws making it clear that training and education of all forms is explicitly allowed under fair use. Ideally, we also send anyone trying to prevent this to the guillotines.
I actually agree with you. I think what the LLM craze has show is that the copyright/IP laws need to adapt and not the other way around.
I think it should be legal to train a model on anything that is legal to scrape (which is almost everything).
Then, if someone uses a generative AI output that violates someones existing IP in an infringing way, go after the person that's trying to monetize that output, whether it's software, an image, or writing.
The thing is, if you limit what these things can be trained on, it creates a huge power imbalance. The wealthy and nation states are still going to scrape everything under the sun and train AIs with that data along with whatever else their surveillance has gathered. If businesses are neutered from being able to do the same, we all lose.
I have whiplash from your first and last sentences.
> Students should be able to learn from books, music, film. So should AI training models.
An AI model is a thing. It is owned and fully controlled by some agent. A student is a sentient, thinking being. Both can be trained, only one can be educated. Treating the two as comparable is misleading and in my view, wrong.
We're in strange new times, but the equivalence of human cognition and synthetic will likely become mainstream and mundane in the coming years.
Sci-fi has long had various "cyborg" type things as a plot element, but if you walk down the street in NYC today you'll pass thousands of people with pacemakers, artificial hips, insulin pumps, colostomy bags, and prosthetics. People who've had laser surgery on their eyes to see better or transplanted organs. Plus people's usage of smart watches that measure heart rate, steps, sleep quality or continuous blood glucose monitors.
We don't marvel at the cyborgs among us, we just accept it as modern medicine. Similarly, while we've gotten used to internet search and GPS turn-by-turn navigation. Gen Z and younger will probably just accept the integration of genAI into their everyday life as seamlessly and casually as we accepted our cyborgification.
You can say that an AI model can only "be trained, not educated" in the same way you can argue that a submarine doesn't swim. But does that really matter to any of the people using it?
You are preoccupied with semantics and romantic notions of blurred lines between people and software, rather than the actual reality of what a model is, and who tends to control it. The "people" training models are mostly massive business interests that exist to create profit.
Fine then, let's get rid of software copyrights too. We can copy the AI software, models, datasets all we want. They don't get copyright protection for their software while declaring that everybody else doesn't get copyright protection for their work.
Regarding "it would be impossible to train today's leading AI models": OpenAI has a pattern of equating humanity's progress with their own progress in their corporate communication.
A similar instance that bugs me is on the documentation page for their GPTBot scraper (https://platform.openai.com/docs/gptbot) where they say "Allowing GPTBot to access your site can help AI models become more accurate". Strange wording, given that is specifically OpenAI's models you're allowing, not "AI models" in general.
The goal is both cases is to make you feel like you're standing in the way of progress by objecting.
If the Internet Archive makes a whoopsie with loaning out books they didn't have the license or permission to and gets sued into the ground by publishers, then OpenAI shouldn't have been/be allowed to use and process copyrighted material either.
OpenAI is actively receiving money from funders and (potentially, maybe, eventually will) make money by using others' copyrighted content at a much larger potential than what the Internet Archive was doing.
OpenAI should not have permission to soullessly suck up copyrighted material and use it to make money.
On the other hand, other countries who don't place ethical/moral/fiscal priority on creating and protecting copyrighted works will eat the wests' lunch when it comes to AI as there's no limitation that's preventing them from consuming the content.
Not sure what the answer is - maybe copyright is an archaic idea/belief built and maintained by a once well-intended, now corrupted economic system that needs a bit of a shakeup anyways...
So? Making money is not a legal right. Copyright is. If you can't make money without misappropriating copyrighted material, then you can't make money that way.
It's a clickbait title, this is not what they are arguing
> "Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today's leading AI models without using copyrighted materials," the company wrote in the evidence filing. "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
> OpenAI went on to insist in the document, submitted before the House of Lords' communications and digital committee, that it complies with copyright laws and that the company believes "legally copyright law does not forbid training."
> it would be impossible to train today's leading AI models without using copyrighted materials,"
Why not just license them like everyone else?
> but would not provide AI systems that meet the needs of today’s citizens.
Needs is doing a lot of work here.
Because they’re not reproducing it.
"Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
They need a new market. This is precisely the kind of AI system I'd love to use.
Yes and no.
They are arguing that the current copyright laws do not forbid training. And they are arguing that they need to train on copyrighted data in order to be able to make an effective tool (and make money).
That second part of the argument is there because, so far as I know, nobody has ruled (in any country) on the legality of using copyrighted material as training for LLMs that will then produce commercially-available output. So the first part is a claim, but it's not a ruled-upon claim. It's not a claim that OpenAI can count on a court agreeing with. So they add the second argument, which amounts to "please interpret copyright law that way, and if the courts don't, please change copyright law that way, or else we can't sell what we make (and therefore can't make any money)".
I take no position on the first claim. All I'm saying is that the appropriate response to the second claim is, "So what? The world doesn't owe you a living."
What exactly is misleading or "clickbait" in the title?
I know that copyright covers blog posts and generally every immaterial creation published by humans that is reproducible and above a fuzzily defined threshold of "original creativity".
The other day, I was downvoted here for criticizing the often-cited "freeware" claim put out by MS.
The argument was: copyright already covers all this, I must lack knowledge about copyright law.
Now, the argument seems to have shifted to: copyright law doesn't apply the way it used to?
Copyright applies to the reproduction, not the consumption. We are free to read or otherwise ingest copyrighted material without legal concerns. We are free to learn from and create content based on those learnings.
Is there any precedence from banning the use of copyright material because someone (thing) might reproduce it later? Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?
Isn't this about generating output after all?
I'm not sure if I get your distinction about "consumption".
> Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?
That was also my point in the prior HN comment thread on the MS news submission that I mentioned.
Good luck starting "fair use" copyright lawsuits against a myriad of auto-generated derivatives. This was already hard for naïve creators with humans and (mostly) human-run corporations on the other end.
If the goal is to prevent companies from training on copyright material, then yes, it is about consuming the material, not generating it. The generation part comes from anecdotal incidents where some copyright material has been generated.
- This is not the normal
- This can be changed over time, there are also moderation techniques that can be used.
- We already have remedies for those publishing or selling copyrighted material already
So I personally see a difference between training time and inference time. Using the potential for copyrighted material to be generated, to prevent its usage a training time is... luddite territory... imho
I'm not a luddite.
And I don't think that my argument was as narrow as you make it out to be.
It's not required to exactly reproduce training material for an AI to output something that wouldn't stand a "fair use" trial.
"Summarize XY, but prefer different words" is already enough for a blog post. And the possibility to do that is not limited to inference-time input.
Copyright law is about humans, not machines. The problem is scale. You deflected this argument instead of addressing it.
And regarding training: you seem to anthropomorphize LLMs in a weird way.
LLMs can only generate content that is entirely derived from their training data.
That the derivation is close to a blackbox for humans does not elevate machines to humans.
The burden of proof about training materials is IMO with LLM companies, not with human creators.
Because companies know full-well that anything that's not an obvious exact reproduction will require humans starting lawsuits in order to claim a copyright violation.
You say:
> - We already have remedies for those publishing or selling copyrighted material already
And I say, with regard to AI, you seem to be intentionally misinterpreting my comment.
This is such an insane take.
At this point, I think as a society we need to just say copyright as a concept and law has completely failed and scrap the whole thing.
The 0.01% of powerful copyright cartel publishers get rich while harming 99.99% of people, because we've seen further erosion of fair use rights, absurdly lengthy expansions of copyright to prop up Disney's profits and expansive interpretation of how much control copyright olders have and zero punishment for abuse of DMCA and other things.
Students should be able to learn from books, music, film. So should AI training models.
If there is any ambiguity about this, we should immediately write laws making it clear that training and education of all forms is explicitly allowed under fair use. Ideally, we also send anyone trying to prevent this to the guillotines.
I actually agree with you. I think what the LLM craze has show is that the copyright/IP laws need to adapt and not the other way around.
I think it should be legal to train a model on anything that is legal to scrape (which is almost everything).
Then, if someone uses a generative AI output that violates someones existing IP in an infringing way, go after the person that's trying to monetize that output, whether it's software, an image, or writing.
The thing is, if you limit what these things can be trained on, it creates a huge power imbalance. The wealthy and nation states are still going to scrape everything under the sun and train AIs with that data along with whatever else their surveillance has gathered. If businesses are neutered from being able to do the same, we all lose.
I have whiplash from your first and last sentences.
> Students should be able to learn from books, music, film. So should AI training models.
An AI model is a thing. It is owned and fully controlled by some agent. A student is a sentient, thinking being. Both can be trained, only one can be educated. Treating the two as comparable is misleading and in my view, wrong.
We're in strange new times, but the equivalence of human cognition and synthetic will likely become mainstream and mundane in the coming years.
Sci-fi has long had various "cyborg" type things as a plot element, but if you walk down the street in NYC today you'll pass thousands of people with pacemakers, artificial hips, insulin pumps, colostomy bags, and prosthetics. People who've had laser surgery on their eyes to see better or transplanted organs. Plus people's usage of smart watches that measure heart rate, steps, sleep quality or continuous blood glucose monitors.
We don't marvel at the cyborgs among us, we just accept it as modern medicine. Similarly, while we've gotten used to internet search and GPS turn-by-turn navigation. Gen Z and younger will probably just accept the integration of genAI into their everyday life as seamlessly and casually as we accepted our cyborgification.
You can say that an AI model can only "be trained, not educated" in the same way you can argue that a submarine doesn't swim. But does that really matter to any of the people using it?
You are preoccupied with semantics and romantic notions of blurred lines between people and software, rather than the actual reality of what a model is, and who tends to control it. The "people" training models are mostly massive business interests that exist to create profit.
Fine then, let's get rid of software copyrights too. We can copy the AI software, models, datasets all we want. They don't get copyright protection for their software while declaring that everybody else doesn't get copyright protection for their work.
Pointless distinction, you'll never see their code or weights if you just get a response from the API, so the license doesn't matter.
> OpenAI pleads it can't make money with o using copyrighted material for free
Then it shouldn't. Bloody profitors.
I don't know if the concept of an AI that I can ask things is feasible on cc-zero training but it would be nice.