> Training on copyleft licensed code is not a license violation. Any more than a person reading it is.

Some might hold that we've granted persons certain exemptions, on account of them being persons. We do not have to grant machines the same.

> In copyright terms, it's such an extreme transformative use that copyright no longer applies.

Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?

>Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?

I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.

> We do not have to grant machines the same.

No we don't have to, but so far we do, because that's the most legally consistent. If you want to change that, you're going to need to pass new laws that may wind up radically redefining intellectual property.

> Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?

Of course it has, if the transformation is extreme, as it appears to be here. If I memorize the lyrics to a bunch of love songs, and then write my own love song where every line is new, nobody's going to successfully sue me just because I can sing a bunch of other songs from memory.

Also, it's not even remotely clear that the LLM can produce the training data near-verbatim. Generally it can't, unless it's something that it's been trained on with high levels of repetition.

I want to briefly pick at this:

> you're going to need to pass new laws that may wind up radically redefining intellectual property

You're correct that this is one route to resolving the situation, but I think it's reasonable to lean more strongly into the original intent of intellectual property laws to defend creative works as a manner to sustain yourself that would draw a pretty clear distinction between human creativity and reuse and LLMs.

> into the original intent of intellectual property laws to defend creative works as a manner to sustain yourself

But you're missing the other half of copyright law, which is the original intent to promote the public good.

That's why fair use exists, for the public good. And that's why the main legal argument behind LLM training is fair use -- that the resulting product doesn't compete directly with the originals, and is in the public good.

In other words, if you write an autobiography, you're not losing significant sales because people are asking an LLM about your life.