So? Making money is not a legal right. Copyright is. If you can't make money without misappropriating copyrighted material, then you can't make money that way.

It's a clickbait title, this is not what they are arguing

> "Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today's leading AI models without using copyrighted materials," the company wrote in the evidence filing. "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."

> OpenAI went on to insist in the document, submitted before the House of Lords' communications and digital committee, that it complies with copyright laws and that the company believes "legally copyright law does not forbid training."

> it would be impossible to train today's leading AI models without using copyrighted materials,"

Why not just license them like everyone else?

> but would not provide AI systems that meet the needs of today’s citizens.

Needs is doing a lot of work here.

Because they’re not reproducing it.

"Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."

They need a new market. This is precisely the kind of AI system I'd love to use.

Yes and no.

They are arguing that the current copyright laws do not forbid training. And they are arguing that they need to train on copyrighted data in order to be able to make an effective tool (and make money).

That second part of the argument is there because, so far as I know, nobody has ruled (in any country) on the legality of using copyrighted material as training for LLMs that will then produce commercially-available output. So the first part is a claim, but it's not a ruled-upon claim. It's not a claim that OpenAI can count on a court agreeing with. So they add the second argument, which amounts to "please interpret copyright law that way, and if the courts don't, please change copyright law that way, or else we can't sell what we make (and therefore can't make any money)".

I take no position on the first claim. All I'm saying is that the appropriate response to the second claim is, "So what? The world doesn't owe you a living."

What exactly is misleading or "clickbait" in the title?

I know that copyright covers blog posts and generally every immaterial creation published by humans that is reproducible and above a fuzzily defined threshold of "original creativity".

The other day, I was downvoted here for criticizing the often-cited "freeware" claim put out by MS.

The argument was: copyright already covers all this, I must lack knowledge about copyright law.

Now, the argument seems to have shifted to: copyright law doesn't apply the way it used to?

Copyright applies to the reproduction, not the consumption. We are free to read or otherwise ingest copyrighted material without legal concerns. We are free to learn from and create content based on those learnings.

Is there any precedence from banning the use of copyright material because someone (thing) might reproduce it later? Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?

Isn't this about generating output after all?

I'm not sure if I get your distinction about "consumption".

> Do the current copyright laws not already protect the authors and give them tools for takedowns and remuneration?

That was also my point in the prior HN comment thread on the MS news submission that I mentioned.

Good luck starting "fair use" copyright lawsuits against a myriad of auto-generated derivatives. This was already hard for naïve creators with humans and (mostly) human-run corporations on the other end.

If the goal is to prevent companies from training on copyright material, then yes, it is about consuming the material, not generating it. The generation part comes from anecdotal incidents where some copyright material has been generated.

- This is not the normal

- This can be changed over time, there are also moderation techniques that can be used.

- We already have remedies for those publishing or selling copyrighted material already

So I personally see a difference between training time and inference time. Using the potential for copyrighted material to be generated, to prevent its usage a training time is... luddite territory... imho

I'm not a luddite.

And I don't think that my argument was as narrow as you make it out to be.

It's not required to exactly reproduce training material for an AI to output something that wouldn't stand a "fair use" trial.

"Summarize XY, but prefer different words" is already enough for a blog post. And the possibility to do that is not limited to inference-time input.

Copyright law is about humans, not machines. The problem is scale. You deflected this argument instead of addressing it.

And regarding training: you seem to anthropomorphize LLMs in a weird way.

LLMs can only generate content that is entirely derived from their training data.

That the derivation is close to a blackbox for humans does not elevate machines to humans.

The burden of proof about training materials is IMO with LLM companies, not with human creators.

Because companies know full-well that anything that's not an obvious exact reproduction will require humans starting lawsuits in order to claim a copyright violation.

You say:

> - We already have remedies for those publishing or selling copyrighted material already

And I say, with regard to AI, you seem to be intentionally misinterpreting my comment.

This is such an insane take.

At this point, I think as a society we need to just say copyright as a concept and law has completely failed and scrap the whole thing.

The 0.01% of powerful copyright cartel publishers get rich while harming 99.99% of people, because we've seen further erosion of fair use rights, absurdly lengthy expansions of copyright to prop up Disney's profits and expansive interpretation of how much control copyright olders have and zero punishment for abuse of DMCA and other things.

Students should be able to learn from books, music, film. So should AI training models.

If there is any ambiguity about this, we should immediately write laws making it clear that training and education of all forms is explicitly allowed under fair use. Ideally, we also send anyone trying to prevent this to the guillotines.

I actually agree with you. I think what the LLM craze has show is that the copyright/IP laws need to adapt and not the other way around.

I think it should be legal to train a model on anything that is legal to scrape (which is almost everything).

Then, if someone uses a generative AI output that violates someones existing IP in an infringing way, go after the person that's trying to monetize that output, whether it's software, an image, or writing.

The thing is, if you limit what these things can be trained on, it creates a huge power imbalance. The wealthy and nation states are still going to scrape everything under the sun and train AIs with that data along with whatever else their surveillance has gathered. If businesses are neutered from being able to do the same, we all lose.

I have whiplash from your first and last sentences.

> Students should be able to learn from books, music, film. So should AI training models.

An AI model is a thing. It is owned and fully controlled by some agent. A student is a sentient, thinking being. Both can be trained, only one can be educated. Treating the two as comparable is misleading and in my view, wrong.

We're in strange new times, but the equivalence of human cognition and synthetic will likely become mainstream and mundane in the coming years.

Sci-fi has long had various "cyborg" type things as a plot element, but if you walk down the street in NYC today you'll pass thousands of people with pacemakers, artificial hips, insulin pumps, colostomy bags, and prosthetics. People who've had laser surgery on their eyes to see better or transplanted organs. Plus people's usage of smart watches that measure heart rate, steps, sleep quality or continuous blood glucose monitors.

We don't marvel at the cyborgs among us, we just accept it as modern medicine. Similarly, while we've gotten used to internet search and GPS turn-by-turn navigation. Gen Z and younger will probably just accept the integration of genAI into their everyday life as seamlessly and casually as we accepted our cyborgification.

You can say that an AI model can only "be trained, not educated" in the same way you can argue that a submarine doesn't swim. But does that really matter to any of the people using it?

You are preoccupied with semantics and romantic notions of blurred lines between people and software, rather than the actual reality of what a model is, and who tends to control it. The "people" training models are mostly massive business interests that exist to create profit.

Fine then, let's get rid of software copyrights too. We can copy the AI software, models, datasets all we want. They don't get copyright protection for their software while declaring that everybody else doesn't get copyright protection for their work.

Pointless distinction, you'll never see their code or weights if you just get a response from the API, so the license doesn't matter.