We need LLMs that have a certificate of origin.

For instance a GPL LLM trained only on GPL code where the source data is all known, and the output is all GPL.

It could be done with a distributed effort.

It is not clear that copyright continues on the LLM output, that is, the output is not necessarily a derivative work.

So "copyleft" doesn't work on any of the output. Therefore no GPL applies.

Not necessarily a bad idea, but I think the bigger issue here and now is the increasing assymmetry in effort between code submitter and reviewer, and the unsustainable review burden on the maintainers if nothing is done.

I don't think the licensing issues are the main problem, but the spam.

Honestly, given that that GPL model would be far below SOTA in capabilities, what exactly would be its use-case? Why would anyone try to use an inferior LLM if they can get away with using a superior one?

It doesn't make sense, because GPL means only GPL comes out, not only GPL can go in:

>Many of the most common free-software licenses, especially the permissive licenses, such as the original MIT/X license, BSD licenses (in the three-clause and two-clause forms, though not the original four-clause form), MPL 2.0, and LGPL, are GPL-compatible. That is, their code can be combined with a program under the GPL without conflict, and the new combination would have the GPL applied to the whole (but the other license would not so apply). https://en.wikipedia.org/wiki/License_compatibility#GPL_comp...

A model that contains no GPL code makes sense so that people using non-GPL licenses don't violate it.

Rather, LLMs that do NOT contain GPL code.