Genuine question: if I train my model with copyleft material, how do you prove I did?

Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.

I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.

Sometime, LLMs actually generate copyright headers as well in their output - lol - like in this PR which was the subject of a recent HN post [1]

https://github.com/ocaml/ocaml/pull/14369/files#diff-062dbbe...

[1] https://news.ycombinator.com/item?id=46039274

I once had a well-known LLM reproduce pretty much an entire file from a well-known React library verbatim.

I was writing code in an unrelated programming language at the time, and the bizarre inclusion of that particular file in the output was presumably because the name of the library was very similar to a keyword I was using in my existing code, but this experience did not fill me with confidence about the abilities of contemporary AI. ;-)

However, it did clearly demonstrate that LLMs with billions or even trillions of parameters certainly can embed enough information to reproduce some of the material they were trained on verbatim or very close to it.

So what? I can probably produce parts of the header from memory. Doesn't mean my brain is GPLed.

The question was "if I train my model with copyleft material, how do you prove I did?"

If your brain was distributed as software, I think it might?

There is a stupid presupposition that LLMs are equivalent to human brains which they clearly are not. Stateless token generators are OBVIOUSLY not like human brains even if you somehow contort the definition of intelligence to include them

Even if they are not "like" human brains in some sense, are they "like" brains enough to be counted similarly in a legal environment? Can you articulate the difference as something other than meat parochialism, which strikes me as arbitrary?

All law is arbitrary. Intellectual property law perhaps most of all.

Famously, the output from monkey "artists" was found to be non-copyrightable even though a monkey's brain is much more similar to ours than an LLM.

[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

If IP law is arbitrary, we get to choose between IP law that makes LLMs propagate the GPL and law that doesn't. It's a policy switch we can toggle whenever want. Why would anyone want the propagates-GPL option when this setting would make LLMs much less useful for basically zero economic benefit? That's the legal "policy setting" you choose when you basically want to stall AI progress, and it's not going to stall China's progress.

> Genuine question: if I train my model with copyleft material, how do you prove I did?

An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?

In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?

Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?

> even if your model was trained strictly on copyleft material

That's not legal use of the material according to most copyleft licenses. Regardless if you end up trying to reproduce it. It's also quite immoral if technically-strictly-speaking-maybe-not-unlawful.

> That's not legal use of the material according to most copyleft licenses.

That probably doesn't matter given the current rulings that training an AI model on otherwise legally acquired material is "fair use", because the copyleft license inherently only has power because of copyright.

I'm sure at some point we'll see litigation over a case where someone attempts to make "not using the material to train AI" a term of the sales contract for something, but my guess would be that if that went anywhere it would be on the back of contract law, not copyright law.

> Genuine question: if I train my model with copyleft material, how do you prove I did?

It may produce it when asked

https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...

> Genuine question: if I train my model with copyleft material, how do you prove I did?

discovery via lawyers

You need low level access to the AI in question, and a lot of compute, but for most AI types, you can infer whether a given data fragment was in the training set.

It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.

Now, would that be enough to put the entire AI under GPL? I doubt it.

I've thought about this as well, especially for the case when it's a company owned product that is AGPLed. It's a really tough situation, because the last thing we want is competitors to come in and LLM wash our code to benefit their own product. I think this is a real risk.

On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.

At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.

I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.

Its why I stopped contributing to open source work. Its pretty clear in the age of LLMs that this breach of the license under which it is written will be allowed to continue and that open source code will be turned into commercial products.

There's the other side of this issue. The current position of the U.S. Copyright Office is that AI output is not copyrightable, because the Constitution's copyright clause only protects human authors. This is consistent with the US position that databases and lists are not copyrightable.[1]

Trump is trying to fire the head of the U.S. Copyright Office, but they work for the Library of Congress, not the executive branch, so that didn't work.[2]

[1] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

[2] https://apnews.com/article/trump-supreme-court-copyright-off...

> Genuine question: if I train my model with copyleft material, how do you prove I did?

The burden is on you to prove that you didn't.

No it is not. It is exactly how the burden of proof works.

https://en.wikipedia.org/wiki/Burden_of_proof_(law)

Maybe we should requiring training data be published or at least referenced.

> Should I keep open sourcing my code now that the licence doesn't matter anymore?

your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!

genuine question: why you are training your model with content that explicitly will have requirements violated if you do?

out of pure spite for hypocritical "hackers"

https://www.penny-arcade.com/comic/2024/01/19/fypm

Anything you produce will be consumed and regurgitated by the machine. It's a personal question for everyone whether you choose to keep providing grist for their mills.