> As a programmer, I want to write more open source than ever, now.
I want to write less, just knowing that LLM models are going to be trained on my code is making me feel more strongly than ever that my open source contributions will simply be stolen.
Am I wrong to feel this? Is anyone else concerned about this? We've already seen some pretty strong evidence of this with Tailwind.
I feel similarly for a different reason. I put my code out there, licensed under the GPL. It is now, through a layer of indirection, being used to construct products that are not under the GPL. That's not what I signed up for.
I know the GPL didn't have a specific clause for AI, and the jury is still out on this specific case (how similar is it to a human doing the same thing?), but I like to imagine, had it been made today, there probably would be a clause covering this usage. Personally I think it's a violation of the spirit of the license.
Yep, this is my take as well. It's not that open source is being stolen as such, as if you abide by an open source license you aren't stealing anything, it's that the licenses are being completely ignored for the profit of a few massive corporations.
Yeah, that's what I meant by "stolen", I should have been clearer. But indeed, this is the crux of the problem, I have no faith that licenses are being abided by.
What profit? All labs are taking massive losses and there's no clear path to profit for most of them yet.
The wealthiest people in tech aren't spending 10s of billions on this without the expectation of future profits. There's risk, but they absolutely expect the bets to be +EV overall.
Expected profit.
GPL works via copyright. Since AI companies claim fair use no copyright applies. There is no fixing this. The only option is not to publish.
There are non-US jurisdictions where you have some options, but since most of them are trained in the US that won't help much.
> Since AI companies claim fair use no copyright applies. There is no fixing this.
They can claim whatever they want. You can still try to stop it via lawsuits and make them claim it in court. Granted, I believe there's already been some jurisdictions that have sided with fair use in those particular cases.
Laws can be changed. This is right now a trillion dollar industry, perhaps later it could even become a billion dollar industry. Either way, it's very important.
Strict copyright enforcement is a competitive disadvantage. Western countries lobbied for copyright enforcement in the 20th century because it was beneficial. Now the tables have turned, don't hold your breath for copyright enforcement against the wishes of the markets. We are all China now.
Yes, I think Japan added an AI friendly copyright law. If there were problems in the US, they'd just move training there.
Moving training won't help them if their paying customers are in jurisdictions which do respect copyright as written and intended.
OPs idea is about having a new GPL like license with a "may not be used for LLM training" clause.
That the LLM itself is not allowed to produce copyrighted work (e.g. just copies of works or too structurally similar) without using a license for that work is something that is probably currently law. They are working around this via content filters. They probably also have checks during/after training that it does not reproduce work that is too similar. There are law suits about this pending if I remember correctly e.g. with the New York Times.
The issue is that everyone is focusing on verbatim (or "too similar") reproduction.
LLMs themselves are compressed models of the training data. The trick is the compression is highly lossy by being able to detect higher-order patterns instead of fucusing on the first-order input tokens (or bytes). If you look at how, for example, any of the Lempel-Ziv algorithms work, they also contain patterns from the input and they also predict the next token (usually byte in their case), except they do it with 100% probability because they are lossless.
So copyright should absolutely apply to the models themselves and if trained on AGPL code, the models have to follow the AGPL license and I have the right to see their "source" by just being their user.
And if you decompress a file from a copyrighted archive, the file is obviously copyrighted. Even if you decompress only a part. What LLMs do is another trick - by being lossy, they decompress probabilistically based on all the training inputs - without seeing the internals, nobody can prove how much their particular work contributed to the particular output.
But it is all mechanical transformation of input data, just like synonym replacement, just more sophisticated, and the same rules regarding plagiarism and copyright infringement should apply.
---
Back to what you said - the LLM companies use fancy language like "artificial intelligence" to distract from this so they can they use more fancy language to claim copyright does not apply. And in that case, no license would help because any such license fundamentally depends on copyright law, which as they claim does not apply.
That's the issue with LLMs - if they get their way, there's no way to opt out. If there was, AGPL would already be sufficient.
I agree with your view. One just has to go into courts and somehow get the judges to agree as well.
An open question would be if there is some degree of "loss" where copyright no longer applies. There is probably case law about this in different jurisdictions w.r.t. image previews or something.
I don't think copyright should be binary or should work the way it does not. It's just the only tool we have now.
There should be a system which protects all work (intellectual and physical) and makes sure the people doing it get rewarded according to the amount of work and skill level. This is a radical idea and not fully compatible with capitalism as implemented today. I have a lot on my to-read list and I don't think I am the first to come up with this but I haven't found anyone else describing it, yet.
And maybe it's broken by some degenerate case and goes tits up like communism always did. But AFAICT, it's a third option somewhere in between, taking the good parts of each.
For now, I just wanna find ways to stop people already much richer than me from profiting from my work without any kind of compensation for me. I want inequality to stop worsening but OTOH, in the past, large social change usually happened when things got so bad people rejected the status quo and went to the streets, whether with empty hands or not. And that feels like where we're headed and I don't know whether I should be exited or worried.
I recall a basics of law class saying that in some countries (e.g. Czech Republic), open source contributors have the right to small compensation if their work is used to a large financial benefit.
At some point, I'll have to look it up because if that's right, the billionaires and wannabe-trillionaires owe me a shitton of money.
One work-around would be to legislate that code produce by an LLM trained on GPL code would also be GPL.
There are licenses that are incompatible with each other, which implies that one wouldn’t be allowed to train LLMs on code based on multiple such licenses.
Sounds reasonable to me - much the same way that building a project from multiple incompatible licenses wouldn't be allowed. The alternative is that using an LLM could just be an end-run around the choice of license that a developer used.
Copyright normally only applies when you’re plagiarizing. LLM output typically isn’t that. It’s more like someone having studied multiple open source projects with incompatible licenses and coding up their own version of them, which is perfectly fine. So your “workaround” is overshooting things by far, IMO.
My understanding is that LLMs are plagiarising openly available code - it's not like the code is used to inspire a person as that involves creative thinking. I'm thinking that taking a piece of code and applying a transformation to it to make it look different (e.g. changing variable/function names) would be still considered plagiarism. In the case of the GPL, I think it would be entirely appropriate for a GPL trained LLM to be required to license its code output as GPL.
I suppose the question is when does a machine applied transformation become a new work?
The argument of the AI megacorps is that generated work is not "derivative" and therefore doesn't fall interact with the original authors copyright. They have invented a machine that takes in copyrighted works, and from a legal standpoint produces "entirely original" code. No license, be that GPL or otherwise, can do anything about that, because they ultimately rely on the authors copyright to required the licensee to observe the license.
They cannot violate the license, because in their view they have not licensed anything from you.
I think that's horse shit, and a clear violation of the intellectual property rights that are supposed to protect creatives from the business boys, but apparently the stock market must grow.
What makes this whole thing even weirder for me is the similar fact that any output from AI might not enjoy copyright protections. So basically if you can steal software made with AI you can freely resell it.
During the gold rush, it is said, the only people who made money were the ones selling the pickaxes. A"I" companies are ~selling~ renting the pickaxes of today.
(I didn't come up with this quote but I can't find the source now. If anything good comes out of LLMs, it's making me appreciate other people's more and trying to give credit where it's due.)
Wasn't it shovels?
NVidia is a shovel-maker worth a few trillion dollars...
What about the people who sold gold? Didn't they make money?
To be honest, I haven't looked at any statistics but I imagine a tiny few of those looking for gold found any and got rich, the most either didn't find anything, died of illness or exposure or got robbed. I just like the quote as a comparison. Updated the original comment to reflect I haven't checked if it's correct.
Now imagine how much more that sucks for artists and designers that were putting artwork out there to advertise themselves only to have some douchebag ingest it in order to sell cheap simulacra.
If you want, I made a coherent argument about how the mechanics of LLMs mean both their training and inference is plagiarism and should be copyright infringement.[0] TL;DR it's about reproducing higher order patterns instead of word for word.
I haven't seen this argument made elsewhere, it would be interesting to get it into the courtrooms - I am told cases are being fought right now but I don't have the energy to follow them.
Plus as somebody else put it eloquently, it's labor theft - we, working programmers, exchanged out limited lifetime for money (already exploitative) in a world with certain rules. Now the rules changed, our past work has much more value, and we don't get compensated.
[0]: https://news.ycombinator.com/item?id=46187330
The first thing you need to do is brush up on some IP law around software in the United States. Start here:
https://en.wikipedia.org/wiki/Idea–expression_distinction
https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
In a court of law you're going to have to argue that something is an expression instead of an idea. Most of what LLMs pump out are almost definitionally on the idea side of the spectrum. You'd basically have to show verbatim code or class structure at the expressive level to the courts.
Thanks for the links, I'll read them in more detail later.
There's a couple issues I see:
1) All of the concepts were developed with the idea that only humans are capable of certain kinds of work needed for producing IP. A human would not engage in highly repetitive and menial transformation of other people's material to avoid infringement if he could get the same or better result by working from scratch. This placed, throughout history, an upper limit on how protective copyright had to be.
Say, 100 years ago, synonym replacement and paraphrasing of sentences were SOTA methods to make copies of a book which don't look like copies without putting in more work than the original. Say, 50 years ago, computers could do synonym replacement automatically so it freed up some time for more elaborate restructuring of the original work and the level of protection should have shifted. Say, 10 years ago, one could use automatic replacement of phrases or translation to another language and back, freeing up yet more time.
The law should have adapted with each technological step up and according to your links it has - given the cases cited. It's been 30 years and we have a massive step up in automatic copying capabilities - the law should change again to protect the people who make this advancement possible.
Now with a sufficiently advanced LLM trained on all public and private code, you can prompt them to create a 3D viewer for Quake map files and I am sure it'll most of the time produce a working program which doesn't look like any of the training inputs but does feel vaguely familiar in structure. Then you can prompt it to add a keyboard-controlled character with Quake-like physics and it'll produce something which has the same quirks as Quake movement. Where did bunny hopping, wallrunning, strafing, circlejumps, etc. come from if it did not copy the original and the various forks?
Somebody had to put in creative work to try out various physics systems and figure out what feels good and what leads to interesting gameplay.
Now we have algorithms which can imitate the results but which can only be created by using the product of human work without consent. I think that's an exploitative practice.
2) It's illegal to own humans but legal to own other animals. The USA law uses terms such as "a member of the species Homo sapiens" (e.g. [0]) in these cases.
If the legality of tech in question was not LLMs but remixing of genes (only using a tiny fraction of human DNA) to produce a animals which are as smart as humans with chimpanzee bodies which can be incubated in chimpanzee females but are otherwise as sentient as humans, would (and should) it be legal to own them as slaves and use them for work? It would probably be legal by the current letter of the law but I assure you the law would quickly change because people would not be OK with such overt exploitation.
The difference is the exploitation by LLM companies is not as overt - in fact, mane people refer to LLMs as AIs and use pronouns such as "he" or "she", indicating them believe them to be standalone thinking entities instead of highly compressed lossy archives of other people's work.
3) The goal of copyright is progress, not protection of people who put in work to make that progress possible. I think that's wrong.
I am aware of the "is" vs "should" distinction but since laws are compromises between the monopoly in violence and the people's willingness to revolt instead of being an (attempted) codification of a consistent moral system, the best we can do is try to use the current laws (what is) to achieve what is right (what should be).
[0]: https://en.wikipedia.org/wiki/Unborn_Victims_of_Violence_Act
But "vaguely familiar in structure" could be argued to be the only reasonable way to do something, depending on the context. This is part of the filtration step in AFC.
The idea of wallrunning should not be protected by copyright.
The thing is a model trained on the same input as current models except Quake and Quake derivatives would not generate such code. (You'd have to prompt it with descriptions of quake physics since it wouldn't know what you mean, depending on whether only code or all mentions were excluded.)
The quake special behaviors are results of essentially bugs which were kept because it led to fun gameplay. The model would almost certainly generate explicit handling for these behaviors because the original quake code is very obviously not the only reasonable way to do it. And in that case the model and its output is derivative work of the training input.
The issue is such an experiment (training a model with specific content excluded) would cost (tens/hundreds of?) millions of dollars and the only companies able to do it are not exactly incentivized to try.
---
And then there's the thing that current LLMs are fundamentally impossible to create without such large amounts of code as training data. I honestly don't care what the letter of the law is, to any reasonable person, that makes them derivative work of the training input and claiming otherwise is a scam and theft.
I always wonder if people arguing otherwise think they're gonna get something out of it when the dust settles or if they genuinely think society should take stuff from a subgroup of people against their will when it can to enrich itself.
“Exploitative” is not a legal category in copyright. If the concern is labor compensation or market power, that’s a question for labor law, contract law, or antitrust, not idea-expression analysis and questions of derivative works.
There was a legal analysis of the copyright implications of Copilot among a set of white papers commissioned by the Free Software Foundation: https://www.fsf.org/licensing/copilot/copyright-implications...
And HN does its thing again - at least 3 downvotes, 0 replies. If you disagree, say why, otherwise I have to assume my argument is correct and nobody has any counterarguments but people who profit from this hate it being seen.
I agree that training on copyrighted material is violating the law, but not for the reasons you stated.
That said, this comment is funny to me because I’ve done the same thing too, take some signal of disagreement, and assume the signal means I’m right and there’s a low-key conspiracy to hold me down, when it was far more likely that either I was at least a bit wrong, or said something in an off-putting way. In this case, I tend to agree with the general spirit of the sibling comment by @williamcotton in that it seems like you’re inventing some criteria that are not covered by copyright law. Copyrights cover the “fixation” of a work, meaning they protect only its exact presentation. Copyrights do not cover the Madlibs or Cliff Notes scenarios you proposed. (Do think about Cliff Notes in particular and what it implies about AI - Cliff Notes are explicitly legal.)
Personally, I’ve had a lot of personal forward progress on HN when I assume that downvotes mean I said something wrong, and work through where my own assumptions are bad, and try to update them. This is an important step especially when I think I’m right.
I’m often tempted to ask for downvote explanations too, but FWIW, it never helps, and aside from HN guidelines asking people to avoid complaining about downvotes, I find it also helps to think of downvotes as symmetric to upvotes. We don’t comment on or demand an explanation for an upvote, and an upvote can be given for many reasons - it’s not only used for agreement, it can be given for style, humor, weight, engagement, pity, and many other reasons. Realizing downvotes are similar and don’t only mean disagreement helps me not feel personally attacked, and that can help me stay more open to reflecting on what I did that is earning the downvotes. They don’t always make sense, but over time I can see more places I went wrong.
> or said something in an off-putting way
It shouldn't matter.
Currently, downvote means "I want this to be ranked lower". There really should be 2 options "factually incorrect" and "disagree". For people who think it should matter, there should be a third option, "rude", which others can ignore.
I've actually emailed about this with a mod and it seems he conflated talking about downvotes with having to explain a reason. He also told me (essentially) people should not have the right to defend themselves against incorrect moderator decisions and I honestly didn't know what to say to that, I'll probably message him again to confirm this is what he meant but I don't have high hopes after having similar interactions with mods on several different sites.
> FWIW, it never helps
The way I see it, it helped since I got 2 replies with more stuff to read about. Did you mean it doesn't work for you?
> downvotes as symmetric to upvotes
Yes, and we should have more upvote options too. I am not sure the explanation should be symmetric though.
Imagine a group conversation in which somebody lies (the "factually incorrect" case here). Depending on your social status within the group and group politics, you might call out the lie in public, in private with a subset or not at all. But if you do, you will almost certainly be expected to provide a reasoning or evidence.
Now imagine he says something which is factually correct. If you say you agree, are you expected to provide references why? I don't think so.
---
BTW, on a site which is a more technical alternative to HN, there was recently a post about strange behavior of HN votes. Other people posted their experience with downvotes here and they mirrored mine - organic looking (i.e. gradual) upvotes, then within minutes of each other several downvotes. It could be coincidence but me and others suspect voting rings evading detection.
I also posted a link to my previous comment as an experiment - if people disagree, they are more likely to also downvote that one. But I did not see any change there so I suspect it might be bots (which are unlikely to be instructed to also click through and downvote there). Note sample size is 1 here, for now.
Maybe if you constructed your argument in terms of the relevant statutes for your jurisdiction, like an actual copyright attorney does, HN might be more receptive to it?
I argue primarily about morality (right and wrong), not legality. The argument is valid morally, if LLM companies found a loophole ion the law, it should be closed.
You literally wrote "it would be interesting to get it into the courtrooms". A court won't give a hoot about your opinions on morality.
1) I appreciate that you differentiate between legality and morality, many people sadly don't.
2) re "hoot": You can say "fuck" here. You've been rudely dismissive twice now, yet you use a veil of politeness. I prefer when people don't hide their displeasure at me.
3) If you think I am wrong, you can say so instead of downvoting, it'll be more productive.
4) If you want me to expend effort on looking up statutes, you can say so instead of downvoting, it'll be more productive.
5) The law can be changed. If a well-reasoned argument is presented publicly, such as in a court room, and the general agreement is that the argument should apply but the court has to reject is because of poorly designed laws, that's a good impetus for changing it.
> I want to write less, just knowing that LLM models are going to be trained on my code is making me feel more strongly than ever that my open source contributions will simply be stolen. Am I wrong to feel this? Is anyone else concerned about this?
I don't think it's wrong, but misdirected maybe. What do you that someone can "steal" your open source contributions? I've always released most of my code as "open source", and not once has someone "stolen" it, it still sits on the same webpage where I initially published it, decades ago. Sure, it's guaranteed ingested into LLMs since long time ago, but that's hardly "stealing" when the thing is still there + given away for free.
I'm not sure how anyone can feel like their open source code was "stolen", wasn't the intention in the first place that anyone can use it for any purpose? That's at least why I release code as open source.
"Open Source" does not equal "No terms on how to share and use the code". Granted, there are such licenses but afaik the majority requires attribution at the minimum.
Then I'd say they're "breaking the license", not "stolen your project", but maybe I'm too anal about the meaning of words.
Yeah, fair, I could have been clearer. But yes, that is what I meant: breaking the license.
I’m unaware of any mainstream Open Source licenses that forbid training an AI model on the work. Are you using one?
[A]GPL is viral, so the derived code must use the same license. People that like that license care a lot about that.
On the other side BSD0 is just a polite version of WTFPL, and people that like it doesn't care about what you do with the code.
And I mostly use MIT, which requires attribution. Does that mean when people use my code, without attribution me, that they're "stealing my code"? I would never call it that, I'd say they're "breaking the license", or similar.
The MIT license doesn’t require attribution for “using...code.” It reads as follows:
> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
> THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The operative language here is “all copies or substantial portions of the Software.” LLMs, with rare exceptions, don’t retain copies or substantial portions of the software it was trained on. They’re not libraries or archives. So it’s unclear to me how training an AI model with an MIT-licensed project could violate the license.
(IAAL and this is my personal analysis, not legal advice.)
I think the GP said "use" in the programer sense, i.e. ctr-C&ctr-V into your program. Not in the normal sense, i.e. double click on the icon. So I guess we all agree.
I don't understand the mindset because I began my foray into open source exactly because I wanted to distribute and share my code.
in other words, i've never been in the position that I felt my charitable givings anywhere were ever stolen.
Some people write code and put it out there without caveats. Some people jump into open source to be license warriors. Not me. I just write code and share it. If youre a person, great. if you're a machine then I suppose that's okay too -- I don't want to play musical chairs with licenses all day just to throw some code out there, and I don't particularly care if someone more clever than myself uses it to generate a profit.
Me too.
I’ve never been a fan of coercive licensing. I don’t consider that “open.” It’s “strings-attached.”
I make mine MIT-licensed. If someone takes my stuff, and gets rich (highly unlikely), then that’s fine. I just don’t want some asshole suing me, because they used it inappropriately, or a bug caused them problems. I don’t even care about attribution.
I mainly do it, because it forces me to take better care, when I code.
You wouldn't even be the 100th developer to eventually regret that.
> regret that
I'm not exactly sure what you mean. I've been doing it for a couple of decades, so far, and haven't regretted it. Am I holding it wrong?
I'd be grateful for some elucidation.
Thanks!
Do you really struggle to understand the mindset?
Some people are happy to release code openly and have it used for anything, commercial or otherwise. Totally understandable and a valid choice to make.
Other people are happy to release code openly so long as people who incorporate it into their projects also release it in the same way. Again, totally understandable and valid.
None of this is hard to understand or confusing or even slightly weird.
I don't know if you're "wrong", but I do feel differently about this.
I've written a ton of open source code and I never cared what people do with it, both "good" or "bad". I only want my code to be "useful". Not just to the people I agree with, but to anyone who needs to use a computer.
Of course, I'd rather people use my code to feed the poor than build weapons, but it's just a preference. My conviction is that my code is _freed_ from me and my individual preferences and shared for everyone to use.
I don't think my code is "stolen", if someone uses it to make themselves rich.
And in that case, use MIT license or something like that for your code, and all is good. If I use AGPL, on the other hand, AI companies should not be allowed to train on that and then use the result of that training while ignoring the license.
> Not just to the people I agree with, but to anyone who needs to use a computer.
Why not say "... but to the people I disagree with"?
Would you be OK knowing your code is used to cause more harm than good? Would you still continue working on a hypothetical OSS which had no users, other than, say, a totalitarian government in the middle east which executes homosexuals? Would you be OK with your software being a critical directly involved piece of code for example tracking, de-anonymizing and profiling them?
Where is the line for you?
As for me that's a risk I'm willing to accept in return for the freedom of the code.
I'm not going to deliberately write code that's LIKELY to do more harm than good, but crippling the potential positive impact just because of some largely hypothetical risk? That feels almost selfish, what would I really be trying to avoid, personally running into a feel-bad outcome?
I think it would be most interesting to find ways to restrict bad usage without crippling the positive impact.
Douglas Crockford[0] tried this with JSON. Now, strictly speaking, this does not satisfy the definition of Open Source (it merely is open source, lowercase). But after 10 years of working on Open Source, I came to the conclusion that Open Source is not the absolute social good we delude ourselves into thinking.
Sure, it's usually better than closed source because the freedoms mean people tend to have more control and it's harder for anyone (including large corporations) to restrict those freedoms. But I think it's a local optimum and we should start looking into better alternatives.
Android, for example, is nominally Open Source but in reality the source is only published by google periodically[1], making any true cooperation between the paid devs and the community difficult. And good luck getting this to actually run on a physical device without giving up things like Google Play or banking apps or your warranty.
There's always ways to fuck people over and there always will be but we should look into further ways to limit and reduce them.
[0]: https://en.wikipedia.org/wiki/Douglas_Crockford
[1]: https://www.androidauthority.com/aosp-source-code-schedule-3...
I agree with the GP. While I wouldn’t be happy about such uses, I see the use as detached from the software as-is, given (assuming) that it isn’t purpose-built for the bad uses. If the software is only being used for nefarious purposes, then clearly you have built the wrong thing, not applied the wrong license. The totalitarian government wouldn’t care about your license anyway.
The one thing I do care about is attribution — though maybe actually not in the nefarious cases.
> The totalitarian government wouldn’t care about your license anyway.
I see this a lot and while being technically correct, I think it ignores the costs for them.
In practice such a government doesn't need to have laws and courts either but usually does because the appearance of justice.
Breaking international laws such as copyright also has costs for them. Nobody will probably care about one small project but large scale violations could (or at least should) lead to sanctions.
Similarly, if they want to offer their product in other countries, now they run the risk of having to pay fines.
Finally, see my sibling comment but a lot of people act like Open Source is an absolute good just because it's Open Source. By being explicit about our views about right and wrong, we draw attention to this delusion.
It’s fine to use whatever license you think is right. That includes the choice of using a permissive license. Restrictions are generally an impediment for adoption, due to their legal risk, even for morally immaculate users. I think that not placing usage restrictions on open source is just as natural as not placing usage restrictions on published research papers.
Tragedy of the commons. If all software had (compatible) clauses about permitted usage, then the choice would be to rewrite it inhouse or accept the restrictions. When there are alternatives (copyleft or permissive) which are not significantly worse, those will get used instead, even if taken in isolation, the restricted software was a bigger social good.
Then why open source something in the first place? The entire point is to make it public, for anyone to use however is useful to him or her, and often to publicly collaborate on a project together.
If I made something open source, you can train your LLM on it as much as you want. I'm glad my open source work is useful to you.
Plenty of people will gladly give you their hard work for free if you promise you'll return the favor. Or if you promise not to take your work and make others pay for it when they could just get it for free. Basically, help the people that want to embrace the freedoms of open source, but not the ones that are just in it for the free labour. Or at the very, very least, include a little "thank you" note.
AI doesn't hold up its end of the bargain, so if you're in that mindset you now have to decide between going full hands-off like you or not doing any open source work at all.
Given the amount of value I get from having AI models help me write code I would say that AI is paying me back for my (not insignificant) open source contributions a thousand times over.
Good for you, I guess? That doesn't really change the situation much for the people who do care and/or don't use AI.
I consider the payment I and my employer make to these AI companies to be what the LLM is paying me back for. Even the free ones get paid for my usage somehow. This stuff isn't charity.
You're quite vigorously replying to anyone disagreeing with the post (and haven't contributed to the top level as far as I can tell).
It comes across as really trying too hard and a bit aggressive.
You could just write one top level comment and chill a bit. Same advice for any future threads too...
> The entire point is to make it public, for anyone to use however is useful to him or her
The entire point isn’t to allow a large corporation to make private projects out of your open source project for many open source licenses. It’s to ensure the works that leverage your code are open source as well. Something AI is completely ignoring using various excuses as to why their specific type of theft is ok.
There is an open source world that believes in the MIT license which has no obligation to keep the derivative FOSS
Even the MIT license requires attribution, all of that gets lost when training an LLM.
Read all the text of the license carefully: https://news.ycombinator.com/item?id=46577208
I don't worry about that too much. I still contribute to FOSS projects, and I use FOSS projects. Whenever I contribute, I usually fix something that affects me (or maybe just something I encountered), and fixing it has a positive effect on the users of that software, including me.
I dont understand the invocation of tailwind here. It doesn't make sense. Tailwind's LLM struggles had nothing to do with open source, it had to do with the fact that they had the same business model as publisher, with ads pointing to their only product.
Exactly, their issue was about a drop in visits to their documentation site where they promote their paid products. If they were making money from usage, their business could really thrive with LLMs recommending Tailwind by default
AFAIK their issue is that LLMs have been trained on their paid product (Tailwind UI, etc.) and so can reproduce them very easily for free. Which means devs no longer pay for the product.
In other words, the open source model of "open core with paid additional features" may be dead thanks to LLMs. Perhaps less so for some types of applications, but for frameworks like Tailwind very much so.
That's not what Adam said. He said it was a traffic issue.
A common intention with opensource is to allow people, and AI tools they use, to reuse, recombine, etc. OSS code in any way they see fit. If that's not what you want, don't open source your work. It's not stealing if you gave it away and effectively told people "do whatever you want". Which is one way licenses such as the MIT license are often characterized.
It's very hard to prevent specific types of usage (like feeding code to an LLM) without throwing out the baby with the bathwater and also preventing all sorts of other valid usages. AGPLv3, which is what antirez and Redis use goes to far IMHO and still doesn't quite get the job done. It doesn't forbid people (or tools) to "look" at the code which is what AI training might be characterized as. That license creates lots of headaches for corporate legal departments. I switched to Valkey for that reason.
I actually prefer using MIT style licenses for my own contributions precisely because I don't want to constrain people or AI usage. Go for it. More power to you if you find my work useful. That's why I provide it for free. I think this is consistent with the original goals of open source developers. They wanted others to be able to use their stuff without having to worry about lawyers.
Anyway, AI progress won't stop because of any of this. As antirez says, that stuff is now part of our lives and it is a huge enabler if you are still interested in solving interesting problems. Which apparently he is. I can echo much of what he says. I've been able to solve larger and larger problems with AI tools. The last year has seen quite a bit of evolution in what is possible.
> Am I wrong to feel this?
I think your feelings are yours. But you might at least examine your own reasoning a bit more critically. Words like theft and stealing are big words. And I think your case for that is just very weak. And when you are coding yourself are you not standing on the shoulders of giants? Is that not theft?
> Am I wrong to feel this?
Why would a feeling be invalid? You have one life, you are under no obligation to produce clean training material, much less feel bad about this.
I think the Tailwind case is more complicated than this, but yes - I think it's reasonable to want to contribute something to the common good but fear that the value will disproportionally go to AI companies and shareholders.
Yes. If you didn't care before when contributing to open source who uses your code then it shouldn't matter now that a company picks up your code. You are also contributing this way too.
Tailwind is a business and they picked a business model that wasn't resilient enough.
This is a dilemma for me that gets more and more critical as I finalize my thesis. My default mental model was to open source for the sake of contributing back to the community, enhance my ideas and discuss them with whoever finds it interesting.
To my surprise, my doctoral advisor told me to keep the code closed. She told me not only LLMs will steal it and benefit from it, but there's a risk of my code becoming a target after it's stolen by companies with fat attorney budgets and there's no way I could defend and prove anything.
I do open source exactly because i’m fine my work can be “stolen”.
Stolen means no attribution and not following the rules of the GPL, instead producing un-attributed AI-washed closed source code owned by companies.
GPL requires attribution. Some people are fine with their code being used by others for free while still expecting their work to be acknowledged. Code posted on Stackoverflow is apparently CC-BY-SA licensed, which means attribution is still required.
I'm convinced that LLMs results in all software needing to be open source (or at the very least source available).
In future everyone will expect to be able to customise an application, if the source is not available they will not chose your application as a base. It's that simple.
The future is highly customisable software, and that is best built on open source. How this looks from a business perspective I think we will have to find out, but it's going to be fun!
Why do you think customization can only viably done via changing the code of the application itself.
I think there is room for closed source platforms that are built on top of using LLMs via some sort of API that it exposes. For example, iOS can be closed source and LLMs can develop apps for it to expand the capabilities of one's phone.
Allowing total customization by a business can allow them to mess up the app itself or make other mistakes. I don't think it's the best interface for allowing others to extend the app.
I'm convinced of the opposite. I think a lot more software will be closed source so that an LLM cannot reproduce it from its training data for free.
> In future everyone will expect to be able to customise an application, if the source is not available they will not chose your application as a base. It's that simple.
This seems unlikely. It's not the norm today for closed-source software. Why would it be different tomorrow?
Because we now have LLMs that can read the code for us.
I'm feeling this already.
Just the other day I was messing around with Fly's new Sprites.dev system and I found myself confused as to how one of the "sprite" CLI features worked.
So I went to clone the git repo and have Claude Code figure out the answer... and was surprised to find that the "sprite" CLI tool itself (unlike Fly's flycli tool, which I answer questions about like this pretty often) wasn't open source!
That was a genuine blocker for me because it prevented me from answering my question.
It reminded me that the most frustrating thing about using macOS these days is that so much of it is closed source.
I'd love to have Claude write me proper documentation for the sandbox-exec command for example, but that thing is pretty much a black hole.
I'm not convinced that lowering the barrier to entry to software changes will result in this kind of change of norms. The reasons for closed-source commercial software not supporting customisation largely remain the same. Here are the ones that spring to mind:
• Increased upfront software complexity
• Increased maintenance burden (to not break officially supported plugins/customizations)
• Increased support burden
• Possible security/regulatory/liability issues
• The company may want to deliberately block functionality that users want (e.g. data migration, integration with competing services, or removing ads and content recommendations)
> That was a genuine blocker for me because it prevented me from answering my question.
It's always been this way. From the user's point of view there has always been value in having access to the source, especially under the terms of a proper Free and Open Source licence.
[dead]
This is why I never got into open source in the first place. I was worried that new programmers might read my code, learn how to program, and then start independently contributing the the projects I know and love - significantly devaluing my contributions.
Unless I am missing something, it seems that you only need to use something like the following that was (obtained using quick search, haven't tried)
https://archclx.medium.com/enforcing-gpg-encryption-in-githu...
My opinion on the matter is that AI models stealing the open source code would be ok IF the models are also open and remain so, and the services like chatgpt will remain free of cost (at least a free tier), and remain free of ads.
But we all know how it is going to go.
Not wrong. But i don't share your concerns at all. I like sharing code and if people, and who knows, machines, can make use of it and provide some value however minute, that makes me content.
> But, in general, it is now clear that for most projects, writing the code yourself is no longer sensible, if not to have fun.
I want to write code to defy this logic and express my humanity. "To have fun", yes. But also to showcase what it means when a human engages in the act of programming. Writing code may increasingly not be "needed", but it increasingly is art.
This is an absolute valid concern. We either need strong governmental interventions to these models who don't comply with OSS.
Or accept that there definitely wont be open model businesses. Make them proprietary and accept the fact that even permissive licenses such as MIT, BSD Clause 2/3 wont't be followed by anyone while writing OSS.
And as for Tailwind, I donno if it is cos of AI.
With Tailwind, wasn't the problem that much fewer people visited the documentation, which showed ads? The LLMs still used Tailwind
Use a license that doesn't allow it then.
Not everything needs to be mit or gnu.
LLMs don't care about licenses. And even if they did, the people who use them to generate code don't care about licenses.
Thieves don't care about locks, so doors are pointless.
Thieves very much do care about doors and locks, because they are a physical barrier that must be bypassed, and doing so is illegal.
Software licenses aren't, AI companies can just take your GPL code and spit it back out into non-GPL codebases and there's no way for you to even find out it happened, much less do anything about it, and the law won't help you either.
> Am I wrong to feel this?
There's no such thing as a wrong feeling.
And I say this as one of those with the view that AI training is "learning" rather than "stealing", or at least that this is the goal because AI is the dumbest, the most error prone, and also the most expensive way, to try to make a copy of something.
My fears about setting things loose for public consumption are more about how I will be judged for them than about being ripped off, which is kinda why that book I started writing a decade ago and have not meaningfully touched in the last 12 months is neither published properly nor sent to some online archive.
When it comes to licensing source code, I mostly choose MIT, because I don't care what anyone does with the code once it's out there.
But there's no such thing as a wrong feeling, anyone who dismisses your response is blinding themselves to a common human response that also led to various previous violent uprisings against the owners of expensive tools of automation that destroyed the careers of respectable workers.
I want to write less, because quite frankly I get zero satisfaction from having an LLM churn out code for me, in the same way that Vincent van Gogh would likely derive no joy from using Nano Banana to create a painting.
And sure, I could stubbornly refuse to use an LLM and write the code myself. But after getting used to LLM-assisted coding, particularly recent models, writing code by hand feels extremely tedious now.
If you don't want people "stealing" your code, you don't want open source. You want source available.
You're confusing open source with public domain.
I've been writing a bunch of DSLs lately and I would love to have LLMs train on this data.
If you give, and expect something in return, then you are not giving, that is a transaction.
No, you're absolutely right.
LLMs are labor theft on an industrial scale.
I spent 10 years writing open source, I haven't touched it in the last 2. I wrote for multiple reasons none of which any longer apply:
- I believe every software project should have an open source alternative. But writing open source now means useful patterns can be extracted and incorporated into closed source versions _mechanically_ and with plausible deniability. It's ironically worse if you write useful comments.
- I enjoyed the community aspect of building something bigger than one person can accomplish. But LLMs are trained on the whole history and potentially forum posts / chat logs / emails which went into designing the SW too. With sufficiently advanced models, they effectively use my work to create a simulation of myself and other devs.
- I believe people (not just devs) should own the product they build (an even stronger protection of workers against exploitation than copyright). Now our past work is being used to replace us in the future without any compensation.
- I did it to get credit. Even though it was a small motivation compared to the rest, I enjoyed everyone knowing what I accomplished and I used it during job interviews. If somebody used my work, my name was attached to it. With LLMs, anyone can launder it and nobody knows how useful my work was.
- (not solely LLM related) I believed better technology improves the world and quality of life around me. Now I see it as a tool - neutral - to be used by anyone for both good and bad purposes.
Here's[0] a comment where I described why it's theft based on how LLMs work. I call it higher order plagiarism. I haven't seen this argument made by other people, it might be useful for arguing about those who want to legalize this.
In fact, I wonder if this argument has been made in court and whether the lawyers understand LLMs enough to make it.
[0]: https://news.ycombinator.com/item?id=46187330
> As a programmer, I want to write more open source than ever, now.
I believe open source will become a bit less relevant in it’s current form, as solution/project tailored libraries/frameworks can be generated in a few hours with LLMs.
I’ve written plenty of open source and I’m glad it’s going into the great training models that help everyone out.
I love AI and pay for four services and will never program without AI again.
It pleases me that my projects might be helping out.
Also open source without support has zero value. And you can support only 1-2 projects.
Meaning 99% of everything oss released now is de-facto abandonware.
Also why would I use your open source project, when I can just prompt the AI to generate one for me, gracefully stripping the license as a bonus?
[flagged]
You are not wrong to feel this, because you cannot control what you feel. But it might be worth investigating why you feel this, and why were you writing open source in the first place.
Job insecurity while a bunch of companies claim LLM coding agents are letting them decimate their workforces is a pretty solid reason to feel like your code is being stolen. Many, if not most tech workers have been very sheltered from the harsher economic realities most people face, and many are realizing that labor demand, rather than being special, is why. A core goal of AI products is increasing the supply of what developer labor produces, which reduces demand for that labor. So yeah— feeling robbed when your donated code is used to train models is pretty rational.
Ultimately most things in life and society where one freely gives (and open source could be said to be one such activity) is also balanced by advising everyone participating in the "system" to also reciprocate the same, without which it becomes an exploitative relationship. Examples of such sayings can be found in most major world religions, but a non-religious explanation of the dynamics at hand follows below.
If running an open source model means that I have only given out without receiving anything, there remains the possibility of being exploited. This dynamic has always existed, such as companies using a project and sending in vulnerability reports and the like but not offering to help, and instead demanding, often quite rudely.
In the past working with such extractive contributors may have been balanced with other benefits such as growing exposure leading to professional opportunities, or being able to sell hosted versions, consulting services and paid features, which would have helped the maintainer of the open source project pay off their bills and get ahead in life.
However with the rise of LLMs, it both facilitates usage of the open source tools without getting a chance to direct their attention towards these paid services, nor allows the maintainer to have direct exposure to their contributors. It also indirectly violates the spirit of said open source licenses, as LLMs can spit out the knowledge contained in these codebases at a scale that humans cannot, thus allowing people to bypass the license and create their own versions of the tools, which are themselves not open source despite deriving their knowledge from such data.
Ultimately we don't need to debate about this; if open source remains a viable model in the age of LLMs, people will continue to do it regardless of whether we agree or disagree regarding topics such as this; on the other hand, if people are not rewarded in any way we will only be left with LLM generated codebases that anyone could have produced, leaving all the interesting software development to happen behind closed doors in companies.
It is actually very simple to control what you feel, and very much possible. This deterministic idea about our feelings must die quick. Pro-tip, call the psychology department at your local university and they will happily teach you how to control your feelings.