I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.
Corporations have always talked about the virality of GPL, sometimes but not always to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.
Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)
Wouldn't you just pull it out?
If you were a thoughtful, careful, law-abiding business, yes.
I submit the evidence suggests the genAI companies have none of those attributes.
Not crazy - there's a rational self-interest in doing this.
But I'm not certain that the relevant players have the same consequence-fearing mindset that you do, and to be honest they're probably right. The theft is too great to calculate the consequences, and by the time it's settled, what are you gonna do - turn off Forster's machine?
I hope you're right in at least some cases!
> by the time it's settled
Why would the GPL settle? Even more, who is authorized to settle for every author who used the GPL? If the courts decided in favor of the GPL, which I think would be likely just because of the age and pervasiveness of the GPL, they'd actually have to lobby Congress to write an exception to copyright rules for AI.
A large part of the infrastructure of the world is built on the GPL, and the people who wrote it were clearly motivated by the protection that they thought that the GPL would give to what was often a charitable act, or even an act that would allow companies to share code without having to compete with themselves. I can't imagine too many judges just going "nope."
I think they meant "settled" as in "resolved."
I meant the same. I don't actually think that the GPL is an entity that can settle a court case; if I meant that I would have said the FSF or something. I mean that in order for it to resolve, a judge has to say that the GPL does not apply.
If ultimately copyright holds up against the models*, the GPL will be a permanent holdout against any intellectual property-wide cross-licensing scheme. There's nobody to negotiate with other than the license itself, and it's not going to say anything it hasn't said before.
* It hasn't done well so far, but Obama didn't appoint any SCOTUS judges so maybe the public has a chance against the corporations there.
Why do hard thing when easy thing do trick?
> I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.
Haha no.
https://windsurf.com/blog/copilot-trains-on-gpl-codeium-does...
And just in the last two days, AI generating LGPL headers (which it could not do if identifying LGPL code was pulled from the codebase) and misattributing authors:
https://devclass.com/2025/11/27/ocaml-maintainers-reject-mas...
Thanks for the links.
That first link shows people actively pulling out GPL code in 2023 and marketing around that fact, though. That's not great evidence that they're not doing it now, especially if testing for if GPL code is still in there seems to be as easy as prompting with an incomplete piece of it.
I'd think that companies could amass a collection of all known GPL code and test for it regularly in order to refine their methods for keeping it out.
> (which it could not do if identifying LGPL code was pulled from the codebase)
Are you sure about this? Linking to LGPL code is fine afaik. And why not train on code that linked to universally available libraries that are legal to use? Seems like one might even prefer it.
Seems like this was rejected for size and slop reasons, not licensing. If the submitter of the PR isn't even fixing possibly hallucinated author's names, it's obvious that they didn't really read it. Debugging vibe coded stuff is like finding an indeterminate number of needles in a haystack.