Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently. Even if you repeat patterns and methods you've picked up from that proprietary learning material, it is by no means redistribution. The practical differentiator here is that you do not access the proprietary material during the creation of your own original work, similar in principle to a clean-room design. With AI/ML, it matters that training data is not accessed during inference, which it's not.

The other factor of copyright, which is relevant, is how material is obtained. If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use. If you don't want AI training to be done on your work, you need to put access to it behind explicit authentication with a legally-binding user agreement prohibiting that use-case. Do note that this would lose your project's status as open-source.

> Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently.

Well the difference is that copyright law applies to work fixed in a tangible medium of expression. This covers i.e. model weights on a hard drive but not the human brain. If the model is able to reproduce others’ work verbatim (like the example the article brings up of the song lyrics) then under copyright law that’s unauthorized reproduction. It doesn’t matter that the data is expressed via probabilistic weights because due to past lobbying/lawsuits by the software industry to get compiled binary code covered by copyright, reproduction can include copies that aren’t directly human readable.

> If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use.

There’s over 20 years of successful GPL infringement lawsuits over unlicensed use of publicly available GPL code that disagrees with this point.

so basically we download the sources files to the training weight and remove the LICENSE.MD as it's exactly the same as a person learning to program from proprietay secret code and outputing code based on that for millions of peoples in matter of seconds /s

we also treat as however we want public goods found over the internet. as the World Intellectual Property Organization Copyright Treaty and Berne Convention for the Protection of Literary and Artistic Works aren't real or because we can as we are operating in international waters, selling products for other sails living exclusively in international waters /s

If you download GPL source code and run `wc` on its files and distribute the output of that, is that a violation of copyright and the GPL? What if you do that for every GPL program on github? What if you use python and numpy and generate a list of every word or symbol used in those programs and how frequently they appear? What if you generate the same frequency data, but also add a weighting by what the previous symbol or word was? What if you did that an also added a weighting by what the next symbol or word was? How many statistical analyses of the code files do you need to bundle together before it becomes copyright infringement?

The line is somewhere between running wc on the entire input and running gzip on the entire input.

The fact that a slippery slope is slippery doesn't make it not a slope.