What if Copilot was already trained with Minecraft code in the dataset? Should be possible to test by telling the model to continue a snippet from the leaked code, the same way a news website proved their articles were used for training.

I feel as though the fact that you are asking a valid question shows how transformative it is; clearly, while the LLM gets a general ability to code from its training corpus, the data gets so transformed that it's difficult to tell what exactly it was trained on except a large body of code.

This would still be true of the case where you ask an LLM to rewrite a program while referencing the source. Unless someone was in the room watching or the logs are inspected, how would they know if the LLM was referencing the original source material, or just using general programing knowledge to build something similar.

Then the training itself is the legal question. This doesn't seem all that complicated to me.