As described, this would not be the same thing. If the AI is looking at the source and effectively porting it, that is likely infringement. The idea instead should be "implement Minecraft from scratch" but with behavior, graphics, etc. identical. Note that you'll need to have an AI generate assets or something since you can't just reuse textures and models.

AI models have already looked at the source of GPL software and contain it in their dataset. Adding the minecraft source to the mix wouldn't seem much different. Of course art assets and trade marks would have to be replaced. But an AI "clean room" implementation has yet to be legally tested.

That's why he is saying it's not equivalent. For it to be the same, the LLM would have to train on/transform Minecraft's source code into its weights, then you prompt the LLM to make a game using the specifications of Minecraft solely through prompts. Of course it's copyright infringement if you just give a tool Minecraft's source code and tell it to copy it, just like it would be copyright infringement if you used a copier to copy Minecraft's source code into a new document and say you recreated Minecraft.

What if Copilot was already trained with Minecraft code in the dataset? Should be possible to test by telling the model to continue a snippet from the leaked code, the same way a news website proved their articles were used for training.

I feel as though the fact that you are asking a valid question shows how transformative it is; clearly, while the LLM gets a general ability to code from its training corpus, the data gets so transformed that it's difficult to tell what exactly it was trained on except a large body of code.

Then the training itself is the legal question. This doesn't seem all that complicated to me.

Is there a legal distinction between training, post-training, fine tuning and filling up a context window?

In all of these cases an AI model is taking a copyrighted source, reading it, jumbling the bytes and storing it in its memory as vectors.

Later a query reads these vectors and outputs them in a form which may or may not be similar to the original.

Judges have previously ruled that training counts as sufficiently transformative to qualify for fair use: https://www.whitecase.com/insight-alert/two-california-distr...

I don't know of any rulings on the context window, but it's certainly possible judges would rule that would not qualify as transformative.

[deleted]
[deleted]

It's not equivalent, but it's close enough that you can't easily dismiss it.

For copyright purposes I think there is an important legal distinction between training data (fed in once, ahead of time, and can in theory no longer be recovered as-is) and context window data (stored exactly for the duration of the model call).

I'm not sure there should be, but I think there is.

A room "as clean" as the one under dispute (chardet) is very easy to replicate.

AI 1: - (reads the source), creates a spec + acceptance criteria

AI 2: - implements from spec

AI 1 is in the position of the maintainer who facilitated the license swap.

> Note that you'll need to have an AI generate assets or something since you can't just reuse textures and models.

As far as I know, you can as long as you own a copy of the original. In other words, you can't redistribute the assets, but you can distribute the code that works with them. This is literally how every free/libre game remake works. The copyright of your new, from-scratch code, is in no way linked to that of the assets.

"Behavior, graphics, etc." would likely constitute separate IP from the code. I am not sure there's a model that allows you to make AI reproduce Minecraft without telling it what "Minecraft" is - which would likely contaminate it with IP-protected information.