Wouldn't it be still legal to train on the data due to fair use?

I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.

Honest question: why don’t you think it is fair use?

I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?

These agents are just doing a more sophisticated, faster version of that same act.

Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.

I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.

[1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...

> Who can't contribute to Wine?

> Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.

> I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.

This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).

The fair use prong that's problematic is that the fair use can't decimate the value of the original work. It's the difference between me imitating your art style for a personal project and me making 1,000,000 copies of your art so that your art isn't worth much anymore. One is a fair use, the other is exploitative extraction

Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?

> this is pretty much what LLMs are doing

I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?

Do you honestly believe there are people on this board who haven't used LLMs? Ridiculing someone you disagree with is a poor way to make an argument.

lots of people on this board are philosophically opposed to them so it was a reasonable question, especially in light of your description of them

[deleted]

Just corporations, their shills, and people who think llms are god's gift to humanity disagree with you.