To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
One of the craziest experiences in this "post AI" world is to see how quickly a lot of people in the "information wants to be free" or "hell yes I would download a car" crowds pivoted to "stop downloading my car, just because its on a public and openly available website doesn't make it free"
I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
This principle is also explicitly declared in US law:
> In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
Honest question: why don’t you think it is fair use?
I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
These agents are just doing a more sophisticated, faster version of that same act.
Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.
I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
> Some people cannot contribute to Wine because of potential copyright
violation. This would be anyone who has seen Microsoft Windows source
code (stolen, under an NDA, disassembled, or otherwise). There are some
exceptions for the source code of add-on components (ATL, MFC, msvcrt);
see the next question.
> I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).
Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?
If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
> And a human artist doesn't need to steal million pictures to learn to draw.
They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.
You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".
Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.
That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).
I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.
> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.
I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.
Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work
I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".
I should have been more precise: "If you train and distribute an AI model on this work, it must use the same license as the work".
Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.
My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.
Yet the GPL imposes requirements for me and we consider it free software.
You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply. As it stands, the courts have found training an AI model to be a sufficiently transformative action and fair use which means the resulting output of that training is not a "copy" for the terms of copyright law.
You probably misunderstood how "infection" of GPL works. (which is very common)
If your close-sourced project uses some GPL code, it doesn't automatically put your whole project in public domain or under GPL. It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).
In the simplest terms, GPL is:
if codebase.is_gpl_compitable:
gpl_code.give_permission(code_base)
else if codebase.is_using(gpl_code):
throw new COPYRIGHT_INFRINGEMENT // the copyright owner and the court deal with that with usual copyright laws
GPL can't do much more than that. A license over a piece of code cannot automatically change the copyright status of another piece of code. There simply isn't legal framework for that.
Similarly, AI code's copyleft status can't affect the rest of the codebase, unless we make new laws specifically saying that.
Also similarly, even if Github lost the class action, it will NOT automatically release the model behind GPL to the public. It will open the possibility for all the GPL repo authors to ask Microsoft for compensation for stealing their code.
But then we would need a way to prove that some code was LLM generated, right?
Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.
But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.
It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.
That isn't even remotely a sensible analogy. Equating copyright violation with stealing physical property is an extremely failed metaphor.
One of the craziest experiences in this "post AI" world is to see how quickly a lot of people in the "information wants to be free" or "hell yes I would download a car" crowds pivoted to "stop downloading my car, just because its on a public and openly available website doesn't make it free"
Maybe you have some legalistic point that escapes comprehension, but I certainly consider my house to be much private and the internet public.
I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.
Fair use was for citing and so on not for ripping off 100% of the content.
Copyright protects the expression of an idea, not the idea itself. Therefore, an LLM transforming concepts it learned into a response (a new expression) would hardly qualify as copyright infringement in court.
This principle is also explicitly declared in US law:
> In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work. (Section 102 of the U.S. Copyright Act)
https://www.copyrightlaws.com/are-ideas-protected-by-copyrig...
So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.
Fair use doesn’t need a license, so it doesn’t matter what you put in the license.
Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.
Wouldn't it be still legal to train on the data due to fair use?
I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.
Honest question: why don’t you think it is fair use?
I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
These agents are just doing a more sophisticated, faster version of that same act.
Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.
I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
[1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
> Who can't contribute to Wine?
> Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.
> I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).
Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?
> this is pretty much what LLMs are doing
I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?
Do you honestly believe there are people on this board who haven't used LLMs? Ridiculing someone you disagree with is a poor way to make an argument.
lots of people on this board are philosophically opposed to them so it was a reasonable question, especially in light of your description of them
Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.
By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.
Isn’t it the very reason why we need cleanroom software engineering:
https://en.wikipedia.org/wiki/Cleanroom_software_engineering
If a human reads code, and then reproduces said code, that can be a copyright violation. But you can read the code, learn from it, and produce something totally different. The middle ground, where you read code, and produce something similar is a grey area.
Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
> And a human artist doesn't need to steal million pictures to learn to draw.
They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
How is that enforceable against the fly-by-night startups?
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights
In light of the fact that the courts have found training an AI model to be fair use under US copyright law, it seems unlikely this condition will have any actual relevance to anyone. You're probably going to need to not publicly distribute your software at all, and make such a condition a term of the initial sale. Even there, it's probably going to be a long haul to get that to stick.
Not sure why the FSF or any other organization hasn't released a license like this years ago already.
Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).
Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.
You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".
Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.
That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).
I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.
> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.
I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.
Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work
I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".
I should have been more precise: "If you train and distribute an AI model on this work, it must use the same license as the work".
Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.
My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.
It isn't the difficult, a license that forbids how the program is used is a non-free software license.
"The freedom to run the program as you wish, for any purpose (freedom 0)."
Yet the GPL imposes requirements for me and we consider it free software.
You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
Running the program and analyzing the source code are two different things...?
In the context of Free Software, yes. Freedom one is about the right to study a program.
But training an AI on a text is not running it.
And distributing an AI model trained on that text is neither distributing the work nor a modification of the work, so the GPL (or other) license terms don't apply. As it stands, the courts have found training an AI model to be a sufficiently transformative action and fair use which means the resulting output of that training is not a "copy" for the terms of copyright law.
Model weights, source, and output.
We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.
It's more or less already the case though. Pure AI-generated works without human touches are not copyrightable.
We need it to be infecting the rest like GPL does.
You probably misunderstood how "infection" of GPL works. (which is very common)
If your close-sourced project uses some GPL code, it doesn't automatically put your whole project in public domain or under GPL. It just means you're infringing the right of the code author and they can sue you (for money and stopping using their code, not for making your whole project GPL).
In the simplest terms, GPL is:
GPL can't do much more than that. A license over a piece of code cannot automatically change the copyright status of another piece of code. There simply isn't legal framework for that.Similarly, AI code's copyleft status can't affect the rest of the codebase, unless we make new laws specifically saying that.
Also similarly, even if Github lost the class action, it will NOT automatically release the model behind GPL to the public. It will open the possibility for all the GPL repo authors to ask Microsoft for compensation for stealing their code.
But then we would need a way to prove that some code was LLM generated, right?
Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.
It's done extrajudicially in warzones such as Palestine where hostages are returned from Israeli jails, with missing organs, dead or alive [0].
[0] https://factually.co/fact-checks/justice/evidence-investigat...