A human reading a unit of work is not a “copy”. I’m pretty sure our legal systems agree that thought or sight is not copying something.

Training an LLM inherently requires making a copy of the work. Even the initial act of loading it from the internet and copying it into memory to then train the LLM is a copy that can be governed by its license and copyright law

I think you are confusing two different meanings of the word ‘copy’. The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

> The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

IIRC this exact argument was made in the Blizzard vs bnetd case, wasn't it? Though I can't find confirmation on whether that argument was rejected or not...

It absolutely does! In law and the courts

> The court held that making RAM copies as an essential step in utilizing software was permissible under §117 of the Copyright Act even if they are used for a purpose that the copyright holder did not intend.

https://en.wikipedia.org/wiki/Vault_Corp._v._Quaid_Software_....

[deleted]

> Training an LLM inherently requires making a copy of the work.

But that's not relevant here. Because the copyleft license does not prohibit that (and it's not even clear that any license can prohibit it, as courts may confirm it's fair use, as most people are currently assuming). That's why I noted under (1) that it's not applicable here.

It's absolutely prohibited to copy and redistribute for commercial purposes materials that you're unlicensed to do so with. This isn't an issue when it comes to the copy-left scenario (though it may potentially enforce transitive licensing requirements on the copier that LLM runners don't want to follow) but it is a huge issue that has come up with LLM training.

LLM training involves ingesting works (in a potentially transformative process) and partially reproduce them - that's a generally restricted action when it comes to licensing.

> It's absolutely prohibited to copy and redistribute for commercial purposes materials that you're unlicensed to do so with.

Sure, but that's not what LLM's generally do, and it's certainly not what they're intended to do.

The LLM companies, and many other people, argue that training falls under fair use. One element of fair use is whether the purpose/character is sufficiently transformative, and transforming texts into weights without even a remote 1-1 correspondence is the transformation.

And this is why LLM companies ensure that partial reproduction doesn't happen during LLM usage, using a kind of copyrighted-text filter as a last check in case anything would unintentionally get through. (And it doesn't even tend to occur in the first place, except when the LLM is trained on a bunch of copies of the same text.)

[deleted]

Yea, at the end of the day a big part of this question comes down to whether that copying is fair use and that is an open question with the transformative nature being the primary point in favor of the LLM. But it is copying from some works to another - if it doesn't have some fair use exception it is absolutely violating the licensing of most of the training data. It's a bit different from previous settled case law because it's copying so little from so many billions of different things. I think blocking reproduction is wise by LLM companies for PR purposes but it doesn't guarantee that training is a license exempted activity.

Yup. Of course it's copying. But all expectations are that courts will rule that fair use allows such copying, because of the nature of the transformation.