Hacker News

"if Claude was trained on the LGPL-licensed codebase and its output reflects patterns learned from that code, can the output be treated as license-free? The emerging legal consensus is probably not, and assuming it can creates significant liability for anyone shipping that code commercially."

Is there any citation for this "legal consensus"? I was not aware there was any evidence backed stances on this topic as of yet

onlyrealcuzzo 20 hours ago [ - ]

This sounds like a problem that's pretty easy to get around.

CC does not need LGPL code. There's more than enough BSD and Apache code to go around.

And they can generate synthetic data that is better than LGPL for their training.

It's also a problem that does not seem feasible to meaningfully enforce.

It's easy to generate CC code and lie and say you didn't. It would be hard to prove that you did, especially if you took any precautions to make it even slightly difficult that you did.

adrian_b 20 hours ago [ - ]

Unlike GPL, BSD and Apache licenses do not claim to also cover your non-AI-generated code that only invokes the AI-generated code.

However, even if the BSD/Apache/MIT licensed code can be incorporated freely in your application, you still have no right to remove the copyright notices from it and/or to claim that you own the copyright for it.

Therefore, unless the AI model has been trained only on non-copyrighted public-domain code, incorporating the generated code in your application means that you have removed the copyright notices from it, which is not allowed by the original licenses.

There is absolutely no doubt that using an AI coding assistant works around the copyright laws, but it is still equivalent with doing copy and paste with fragments from copyrighted works into your source code.

I consider that copyright should not be applicable to program sources, at least not in its current form, so reusing parts from other programs should be fair use, but only if human programmers would be allowed to do the same.

onlyrealcuzzo 14 hours ago [ - ]

> However, even if the BSD/Apache/MIT licensed code can be incorporated freely in your application, you still have no right to remove the copyright notices from it and/or to claim that you own the copyright for it.

I can't speak for all licenses, but I'm familiar with at least one BSD license. That's almost the entire point of it...

You cannot take their literal code and call it your own. You can derive code from it and call it your own. That's what LLMs primarily do.

NoMoreNicksLeft 18 hours ago [ - ]

With sufficient obfuscation (which models seem to provide intrinsically), how would anyone know to sue? On top of that, only the most major sorts of litigation have the legal force to pierce even the flimsiest of obfuscation... this is likely all moot.

If some GPL-licensed group were to sue some commercial software project that they do not have the source code for, what would even give it away? But they throw $1 million at a lawyer who can at least get it to the discovery phase somehow, and the source code is provided. It looks to be shit, but maybe an expert witness would come along and say "that looks inspired by the open source project". Where does it go from there? The model is a black box, but maybe you've got a superhero lawyer who manages to rope in Anthropic or OpenAI, and you can see how it produced the code given those prompts. What now? Are there any expert witnesses who both could say and would say that it was "bulk copying-pasting code". And if it were, what jury is going to go for that theory of the crime? Copying-and-pasting, but the code doesn't match, except in short little strings that any code might match. This isn't a slamdunk, and it's not going to proceed very far unless it's another Google-vs-Oracle shitfest.

senaevren 19 hours ago [ - ]

The chardet dispute is the closest thing to an active test case on this specific question, and you are right that it has not resolved into settled law. "Emerging legal consensus" was imprecise. The more accurate framing is: the legal community's working assumption, based on how copyright doctrine treats derivative works, is that training-data provenance travels with the output. That assumption has not been tested definitively in court yet.

senaevren 18 hours ago [ - ]

thanks for this; it's definitely a fair point. I updated the piece to reflect this