Someone should put this to the test. Take the recently leaked Minecraft source code and have Copilot build an exact replica in another programming language and then publish it as open source. See if Microsoft believes AI is copyright infringement or not.

As described, this would not be the same thing. If the AI is looking at the source and effectively porting it, that is likely infringement. The idea instead should be "implement Minecraft from scratch" but with behavior, graphics, etc. identical. Note that you'll need to have an AI generate assets or something since you can't just reuse textures and models.

AI models have already looked at the source of GPL software and contain it in their dataset. Adding the minecraft source to the mix wouldn't seem much different. Of course art assets and trade marks would have to be replaced. But an AI "clean room" implementation has yet to be legally tested.

That's why he is saying it's not equivalent. For it to be the same, the LLM would have to train on/transform Minecraft's source code into its weights, then you prompt the LLM to make a game using the specifications of Minecraft solely through prompts. Of course it's copyright infringement if you just give a tool Minecraft's source code and tell it to copy it, just like it would be copyright infringement if you used a copier to copy Minecraft's source code into a new document and say you recreated Minecraft.

What if Copilot was already trained with Minecraft code in the dataset? Should be possible to test by telling the model to continue a snippet from the leaked code, the same way a news website proved their articles were used for training.

I feel as though the fact that you are asking a valid question shows how transformative it is; clearly, while the LLM gets a general ability to code from its training corpus, the data gets so transformed that it's difficult to tell what exactly it was trained on except a large body of code.

Then the training itself is the legal question. This doesn't seem all that complicated to me.

Is there a legal distinction between training, post-training, fine tuning and filling up a context window?

In all of these cases an AI model is taking a copyrighted source, reading it, jumbling the bytes and storing it in its memory as vectors.

Later a query reads these vectors and outputs them in a form which may or may not be similar to the original.

Judges have previously ruled that training counts as sufficiently transformative to qualify for fair use: https://www.whitecase.com/insight-alert/two-california-distr...

I don't know of any rulings on the context window, but it's certainly possible judges would rule that would not qualify as transformative.

[deleted]
[deleted]

It's not equivalent, but it's close enough that you can't easily dismiss it.

For copyright purposes I think there is an important legal distinction between training data (fed in once, ahead of time, and can in theory no longer be recovered as-is) and context window data (stored exactly for the duration of the model call).

I'm not sure there should be, but I think there is.

A room "as clean" as the one under dispute (chardet) is very easy to replicate.

AI 1: - (reads the source), creates a spec + acceptance criteria

AI 2: - implements from spec

AI 1 is in the position of the maintainer who facilitated the license swap.

> Note that you'll need to have an AI generate assets or something since you can't just reuse textures and models.

As far as I know, you can as long as you own a copy of the original. In other words, you can't redistribute the assets, but you can distribute the code that works with them. This is literally how every free/libre game remake works. The copyright of your new, from-scratch code, is in no way linked to that of the assets.

"Behavior, graphics, etc." would likely constitute separate IP from the code. I am not sure there's a model that allows you to make AI reproduce Minecraft without telling it what "Minecraft" is - which would likely contaminate it with IP-protected information.

I’ve often thought that the key to fighting this is through this exact method. Turn the tool against them

I think it will become interesting when AI will be able to decompile binaries.

Decompiling binaries is easy when they are C# or Java, even before AI. C# is a Microsoft language, and C# games have thriving mod communities with deep hooks into the core game, and detailed documentation reverse-engineered from the binary.

The big question is: if copyrighted material was used in the training material, is the LLM's output copyright infringement when it resembles the training material? In your example, you are taking the copyrighted material and giving it to the LLM as input and instructing the LLM to process it. Regardless of where the legal cards fall, this is a much less ambiguous scenario.

There's a couple of different issues here that all get mangled together. If you're producing effectively the same expression that's infringement. You draw Captain America from memory, it's still Captain America, and therefore infringement. If you draw Captain Canada by tracing around Captain America that's also infringement but of a different type.

When it comes to software, again it's the expression that matters -- literally the actual source code. Software that does the same thing but uses entirely different code to do it is not the same expression. Like with the tracing example above, if you read the original source code then it's harder to claim that it isn't the same expression. This is why clean room implementations are necessary.

I think Disney ran into this with people generating Marvel characters etc

this is the question of the hour. Imagine using this LLM proxy to license-strip major parts of leaked Windows source code to produce code for WINE.

On top of all of this, there are the attempts at binary decompilation using LLMs and other new tools that have been discussed on this site recently.

This was not about legality.

> That question is this: does legal mean legitimate?

Just because something is legal does not mean it's moral thing to do.

this question should've been posed earlier when first LLMs were training. many people chose to ignore the question, and now, several distillation epochs later, it is not a question that matters, as both yes/no are true, and not true.

is it legitimate for millions of people to exploit, expound on knowledge that was perhaps, to begin with, not legitimate to use? well they did already, who's to judge the commons now?

What a ridiculous take. Many people loudly raised the question and objected to the practice from the beginning, but a handful of companies ignored the objections and ran faster than the legal system. If they were in the wrong, legally or morally, they still deserve to face repercussions for it.

it is a take, ridiculous or not. the fact you rage against it implies its not as improbable as you may want it to be. besides ridiculousness is a very subjective matter, right? many things are super ridiculous in 2026 from 2020s perspective, and this just piles on top.

to me is superb ridiculous to shun the comment though. but we'll be having this split for a while, that for sure.

You will probably run into design patents.

Software patents is not a thing in EU.

Very much is. "Software programs, as such" are exempt in the EPC article 52. However if the software program interacts with the world - if it has a "further technical effect" - it is patentable.

https://en.wikipedia.org/wiki/Software_patents_under_the_Eur...

They absolutely are. That's a myth.

But also software patents and design patents are totally different things.

They might not care. Products win not by quality or features but by advertisement, hype and network effects.

The original implementation would still have the upper hand here. OTOH if I as a nobody create something cool, there's nothing stopping a huge corporation from "reimplementing" (=stealing) it and and using their huge advertising budget to completely overshadow me.

And that's how they like it.

Given how hard companies like Nintendo and Microsoft have been taking down leaks or fan creations, it seems they very much do care about keeping this stuff locked down.