Hacker News

I know people really hate AI training on their work - but is it really any different than a human reading it?

I know there's a complaint that AI can verbatim repeat that work. But so can human savants. No one is suing human savants for reading their books.

Producing copyrighted material, of course. Training on copyrighted material... I just don't see it.

EDIT: Making a perfectly valid point, but it's unpopular, so down I go.

Quarondeau 3 days ago [ - ]

There's a huge difference in scale. The human mind can only process a limited portion of all works available over a lifetime. Human learning is therefore naturally limited to small-scale reuse, which serves to keep it proportional.

A machine training on all copyrighted materials in the world for commercial purposes at an industrial scale makes it disproportionate.

qarl 3 days ago [ - ]

I see that as a distinction - but does it make a difference?

If a company hired hundreds of savants, then it would be illegal for them to read books?

I don't follow.

Quarondeau 3 days ago [ - ]

It would hardly make a dent. And if you hired hundreds of savants, the knowledge would still be spread over hundreds of separate minds.

And even if we grant that those savants are also very skilled at creating "market substitutes" based on their training that are capable of competing with the original works, their maximum creative output would only be a relatively small number of new works, because they can only work at human speed.

qarl 3 days ago [ - ]

Ok - but if a company were able to hire one million savants, you feel it should be illegal, because why?

Can you cite something in the copyright laws themselves that suggest this scale distinction?

triceratops 3 days ago [ - ]

Your arguments boil down to "If someone were doing a completely different thing and that's ok, then why isn't this ok?" and "It's not in the text of the law so it's definitely fine."

The one million savants are humans, not machines. Humans get more rights automatically in our world today. That's the moral reason for why your example is not the same. The legal stuff will be worked out in the courts and legislatures of every country in the next 5 years.

Quarondeau 3 days ago [ - ]

This goes back to the original purpose of copyright, which is to serve as an economic incentive for individual creators and artists to make more art, by securing exclusive rights to use their own works commercially for a specified time. The goal is both the creation of more works, but also to protect the economic viability of artists.

This principle is quite universal and can be found in many places, including the US constitution and US (supreme) court decisions, many international jurisdictions, treaties and conventions.

qarl 3 days ago [ - ]

But my question is about your point of scale.

I don't understand why it should be allowed for one savant to study and answer questions about one book, but wrong for a company to hire one million savants to answer questions about one million books.

And I'm asking where in the law or case law this is supported.

Quarondeau 3 days ago [ - ]

That's not what I said. The concern is not about "answering questions" about original works. That would probably be acceptable at any scale. It is whether the approach foreseeably results in the creation of market substitutes that compete with original works.

qarl 2 days ago [ - ]

So then you agree with me that training itself is not a violation, it's whether the output is a violation.

jryan49 3 days ago [ - ]

I had to buy the copyrighted material before reading it... Meta apparently operates in a different legal system than me. That's my issue with it.

qarl 3 days ago [ - ]

Yes, I have no objection to that part. It's the arguments that training itself is the problem.

Sarah Silverman as the most prominent example.

jryan49 3 days ago [ - ]

I mean the act of reproducing the copyrighted material is what is illegal. LLMs I've used for coding has outputted exact copyrights for code verbatim into my code before. When that happens it feels kind of fishy to be honest.

qarl 3 days ago [ - ]

Yes. I agree. But many people argue that training itself is a copyright violation. That's the position I'm countering here.

redsocksfan45 3 days ago [ - ]

[dead]

thomasahle 3 days ago [ - ]

The human savant will remember where they read it and give you credit. It might lead more people to read your work, and ultimately you make money.

The AI won't even know where the page of text it's seeing came from, and people will avoid your book as they can just ask the AI. So you make less money. (Talking about specialized technical books here.)

qarl 3 days ago [ - ]

Not necessarily.

nancyminusone 3 days ago [ - ]

No one is asking human savants about what they read 1 million times per day.

Suppose they did, and some guy was filling stadiums regularly to hear him recite an entire audio book. That would probably get the attention of someone's lawyers.

qarl 3 days ago [ - ]

I don't see your point. The problem is producing the copyrighted work, not processing it beforehand.

If it's illegal for AIs it should be illegal for humans, too. Is that really what you're arguing? It should be illegal for savants to read books?

SahAssar 3 days ago [ - ]

I don't think anyone is arguing that the consumption is illegal. It's the reproduction that is illegal.

Read a book, that's fine. Write a book, that's fine. Read a book and then write a book that is 99.9% the same as the book that you read and sell it for profit without a license from the original author, that's infringement.

qarl 3 days ago [ - ]

No, if you read the article, the point is in the training, not the reproduction.

That's what all these lawsuits are about - it's the training not the reproduction. I already agreed in my first comment that the reproduction is off limits.

In this case, it appears that Meta torrented illegal copies of the work to do the training. Obviously that's bad. But conflating that with training itself doesn't follow.

SahAssar 3 days ago [ - ]

The point of these lawsuits is the piracy. My parent comment was about the general situation, not this specific article.

Pirating content is illegal, regardless of if it is to train an LLM.

Usage of LLMs trained on unlicensed content (basically all of them) might or might not be illegal.

Using any method to reproduce a copyrighted work by using that original as input in a way that supplants the market value of the original is probably illegal.

At least that is my rudimentary understanding.

qarl 3 days ago [ - ]

Well - maybe so. But the common belief is that training itself is a violation of copyright, no matter how it's done. That's the argument I'm countering here.

SahAssar 3 days ago [ - ]

The issue is that the trainers have not sought licenses for the data and instead outright pirated it.

I don't think anyone thinks that all training is a copyright violation if all the training data is licensed. For example a LLM trained on CC0 content would be fine with basically everyone.

The problem is that training happens on data that is not licensed for that use. Some of that data also is pirated which makes it even clearer that it is illegal.

qarl 3 days ago [ - ]

But why should separate licensing be required at all? A search engine reads and indexes every word of every page it crawls. No one argues that requires licensing, only that the outputs must respect copyright. Why should training be different?

SahAssar 3 days ago [ - ]

When google starting outputting summaries people asked the same questions.

If you supplant the value of the original with the original as input then you probably have some legal questions to answer.

qarl 3 days ago [ - ]

But that's about the output, not the training. We agree: outputs that supplant the original are the problem. A model constrained to produce only fair use outputs causes no such harm — regardless of what it was trained on.

lobf 3 days ago [ - ]

Sharing copyrighted material is illegal. Presumably, if Meta blocked all seeding on the torrents they downloaded, they wouldn't have broken copyright, right?

doublescoop 3 days ago [ - ]

If copyright law doesn't extend to the works being used for training, why should it extend to the model that is produced as a result? AI model creators have set up an ethical scenario where the right thing to do is ignore copyright laws when it comes to AI, which includes model use. It might never be legal, but it has become ethical to pirate models, distill them against ToS, etc.

qarl 3 days ago [ - ]

I'm not sure I follow. Can you say it a different way?

SahAssar 3 days ago [ - ]

I think the parent is basically saying that if you can legally pirate a book to train a LLM why can't you legally pirate a LLM model?

It's a "rules for thee and not for me" argument.

qarl 3 days ago [ - ]

AH. Thank you.

triceratops 3 days ago [ - ]

Training requires making copies. Even if Meta had purchased each work they'd have had to make copies of it to distribute around the training cluster.

qarl 3 days ago [ - ]

Does it though? If they bought a copy for each machine?

triceratops 3 days ago [ - ]

Then no copying happened so they'd be on firmer legal ground.

qarl 3 days ago [ - ]

Good, we're agreed. My only point here is that training is not inherently a copyright violation.

Barrin92 3 days ago [ - ]

>The problem is producing the copyrighted work, not processing it beforehand.

the distinction isn't particularly clear cut with an open source model. If it is able to reproduce copyright protected work with high fidelity such that the works produced would be derivative, that's like trying to get around laws against distribution of protected works by handing them to you in a zip file.

It's a kind of copyright washing to hand you the data as a binary blob and an algorithm to extract them out of it. That wouldn't really fly with any other technology.

And that's really where a lot of the value is mind you, these models are best thought of as lossily compressed versions of their input data. Otherwise Facebook ought to be perfectly fine to train them on public domain data.

qarl 3 days ago [ - ]

I tend to agree - but you assume that it would not be possible to create a model that can train on copyrighted work and only output text which would be considered fair use.

That seems very possible to me, and undermines the "training is copyright violation" argument. It's not the training, it's the output.

grebc 3 days ago [ - ]

It’s different.

qarl 3 days ago [ - ]

Hm. I'm not sure I follow your logic.

grebc 3 days ago [ - ]

You asked, I answered.

If you’re struggling to comprehend that a person reading a book is different then you’re a bad bot.

qarl 2 days ago [ - ]

It's a shame that rudeness is so prevalent on this platform now.

fantasizr 3 days ago [ - ]

reading it after stealing it: gray area. producing & monetizing competing works devaluing the original is a problem

qarl 3 days ago [ - ]

So is it a problem when humans produce and monetize competing works? My understanding is that there quite an industry in humans reading books and synthesizing their points. Cliff's Notes, for example.

fantasizr 3 days ago [ - ]

I did some quick googling and most of cliffs notes guides are on public domain works so no problem there, they've also paid to license content, and also have been protected by fair use as parody

qarl 3 days ago [ - ]

To Kill a Mockingbird, The Catcher in the Rye, Beloved, The Kite Runner, The Handmaid's Tale are all copyrighted works with a Cliff's Notes guide.

NoOn3 3 days ago [ - ]

Why should an AI have the same rights as a human?

How about then to grant AI all other rights, for example, to allow voting?(sarcasm)

qarl 3 days ago [ - ]

We're not talking about rights, we're talking about illegal acts. If it's illegal for a machine to do it, how can it be ok for a human?

Just from a rational argumentation point of view. Clearly if a law is written saying as much, then sure. But there is no such copyright law like that yet.

NoOn3 3 days ago [ - ]

The issue is certainly not so simple. But it seems to me, purely theoretically, that the rules don't necessarily have to be the same for living people and non-living machines.

qarl 3 days ago [ - ]

Well - actually - it is pretty simple. For something to be illegal, there must be a law saying it's illegal. There are no laws distinguishing humans from machines in copyright law.

triceratops 3 days ago [ - ]

> There are no laws distinguishing humans from machines in copyright law

Correct. Because until very recently there was no need.

qarl 3 days ago [ - ]

AH. So you agree that it's not illegal.

triceratops 3 days ago [ - ]

What isn't?

qarl 2 days ago [ - ]

I'm just happy you agree with me.

triceratops 2 days ago [ - ]

I don't agree with most of what you've said on this discussion. I couldn't have been clearer about that in my other replies. The only part I did agree on was a hypothetical that hasn't happened.

2 days ago [ - ]

[deleted]

qarl 2 days ago [ - ]

[dead]

pkaeding 3 days ago [ - ]

But machines don't do things. People do things, and they use tools/machines to do those things more easily or efficiently.

qarl 3 days ago [ - ]

My apologies - I'm speaking loosely of course. Translate all my claims about machines breaking the law into claims about humans using machine breaking the law.

pkaeding 3 days ago [ - ]

Sorry, I wasn't trying to be pedantic. I was trying to make the point (which I think is in line with your point) that the fact that AI is involved here doesn't make a difference. It is a tool, but the people using the tool are (as always) responsible for the outcome.

triceratops 3 days ago [ - ]

> I know people really hate AI training on their work - but is it really any different than a human reading it?

Yes it's very different. Humans need to eat, sleep, and pay taxes. You also have to pay them competitive wages.

qarl 3 days ago [ - ]

I'm not sure your argument is supported by the actual law as written.

triceratops 3 days ago [ - ]

https://news.ycombinator.com/item?id=48029673

There's nothing in the law to support your argument either. The law however does say, very unambiguously, that copying without permission isn't allowed . There aren't exceptions for "training" just because it's superficially similar to a human activity (reading a book). A human isn't allowed to hand-copy Harry Potter. Even if they bought all the Harry Potter books.

qarl 3 days ago [ - ]

Yes. But training is not copying.

triceratops 3 days ago [ - ]

We already covered this: https://news.ycombinator.com/item?id=48029085