AI models have already looked at the source of GPL software and contain it in their dataset. Adding the minecraft source to the mix wouldn't seem much different. Of course art assets and trade marks would have to be replaced. But an AI "clean room" implementation has yet to be legally tested.

That's why he is saying it's not equivalent. For it to be the same, the LLM would have to train on/transform Minecraft's source code into its weights, then you prompt the LLM to make a game using the specifications of Minecraft solely through prompts. Of course it's copyright infringement if you just give a tool Minecraft's source code and tell it to copy it, just like it would be copyright infringement if you used a copier to copy Minecraft's source code into a new document and say you recreated Minecraft.

What if Copilot was already trained with Minecraft code in the dataset? Should be possible to test by telling the model to continue a snippet from the leaked code, the same way a news website proved their articles were used for training.

I feel as though the fact that you are asking a valid question shows how transformative it is; clearly, while the LLM gets a general ability to code from its training corpus, the data gets so transformed that it's difficult to tell what exactly it was trained on except a large body of code.

Then the training itself is the legal question. This doesn't seem all that complicated to me.

Is there a legal distinction between training, post-training, fine tuning and filling up a context window?

In all of these cases an AI model is taking a copyrighted source, reading it, jumbling the bytes and storing it in its memory as vectors.

Later a query reads these vectors and outputs them in a form which may or may not be similar to the original.

Judges have previously ruled that training counts as sufficiently transformative to qualify for fair use: https://www.whitecase.com/insight-alert/two-california-distr...

I don't know of any rulings on the context window, but it's certainly possible judges would rule that would not qualify as transformative.

[deleted]
[deleted]

It's not equivalent, but it's close enough that you can't easily dismiss it.

For copyright purposes I think there is an important legal distinction between training data (fed in once, ahead of time, and can in theory no longer be recovered as-is) and context window data (stored exactly for the duration of the model call).

I'm not sure there should be, but I think there is.