> In looking at the code that the LLMs have produced for the project, especially given the pretty massive and widespread architectural changes needed to make the implementation libified and memory safe, we decided that the codebase is not a derivative work that would require carrying forward the GPL license and have decided to release the code under the MIT instead.

Hmm. That's going to be interesting.

Well, there's lots of really interesting opinions here from a lot of armchair lawyers.

To clarify, my stance on this is that the reimplementation did not copy protected expressions (Jplag reports less than 1.8% max similarity between the codebases), it's done in good faith, and it's what's best for the broader Git ecosystem (assuming Grit even becomes usable, which it's currently not purported to be).

From a copyright standpoint, however, only the first argument there is relevant. Grit is an independently authored implementation of Git-compatible behavior, with negligible similarity to Git source code.

I think antirez summarized the situation quite well and I broadly agree with his position: https://antirez.com/news/162

I think that those in the community who know me and have worked with me in the Git and open source communities for the last 20 years know that my intentions are to contribute, share and foster innovation and learning. Many of the main authors of the Git source code are friends of mine and I have no intention to steal anything from anyone, only to make their great ideas more broadly useful.

Hey AI, please change my stolen code in a non-breaking way so that jplag reports less than 1,8% similarity.

I mean ”hey artist, take this stolen character and make them legally distinct” is already a common thing.

there are event exact measurements to take into account, for visual art, music etc. 'what is legally not stealing'.

Art, however, is a little different than code. code is a thing, but it also produces things.

It weirds me out there is a measure of code similarity but not a measure of if code is semantically the same. for example implementing a protocol could be done in many ways, but ultimately whats talked between clients/servers on the network is the same. so it's semantically the same despite being totally different code.

A translation of a book to a different language is a derivative work. So a translation of a computer program to a different programming language is also. But if in the translation of the book you start altering the plot and the personalities of that characters, does it at some point become not a derivative work? What point? IANAL, and I have no real idea, but I imagine that point has been probed significantly in case-law with respect to creative works. Given the current climate of ever-expanding scope of "intellectual property", if they admit that the LLM had access to git source code then I would say their case is weak at best.

> translation.

It's not technically a translation, it's a re-implementation, with test suites acting as the destination. If it was a file by file translation your argument would have been valid.

Git is part of the LLM's training set though, so simply asking it to recreate git in another language is pretty equivalent. Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

That's something I have been wondering. If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation. I don't see why this shouldn't apply to LLMs as well. If an LLM might have been trained on the original source code, it should be considered "tainted".

Yes, and realistically any code that LLMs produce is a derivative work of its training data. There's going to be a huge disaster licensing wise

I have absolutely no idea how LLMs got through anyone's legal departments, I guess the hope is that if everyone breaks the law enough, it'll just be fine

Problem is there's a lot more than a single repo in training data, the corpus is massive... Should the author of a blog post on cats also be compensated for simply being in the same training data as the git repo?

> If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation.

That is the difference between necessary and sufficient. Clean-room is sufficient to guarantee avoiding copyright, but it is not necessary. The line legally is south of there, but that position was chosen because they didn’t want to crossing and it was easier to argue for legally in court.

tl;dr: clean room is overkill for avoiding copyright infringement

> Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

Are you sure? LLMs are in some way a compressed version of their input but it's a pretty lossy compression (arguably this makes them more like a compression algorithm than a compressed version of the data). I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

> I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

Granted, these are some of the most widely spread texts, and not codebases, but just fyi: https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

[deleted]

Yes, but as soon as copyright became a problem for very rich people parts of it were cancelled.

1) re-implementation for compatibility (which was quickly "reestablished" through use of copyright-protecting encryption. In other words: do you get to write software that connects to MS/Apple/Google/Facebook servers without authorization from those companies? Yes. Do you get to copy an encryption key from their software to make it possible? No)

and, more recently,

2) violating copyright for LLM training

and, currently mostly attempted:

3) "uncopyrighting" run software through an LLM, and some people "believe" it comes out with your copyright on it! Because very rich people want to sell uncopyrighting.

Ie. the jury's still out what will happen when it's billionnaire vs billionnaire.

Of course, the question is what happens the second someone does this with a disney movie, or a big microsoft application ...

> Yes, but as soon as copyright became a problem for very rich people parts of it were cancelled.

When copyright law was established, not many poor people owned printing presses. That is to say, copyright law is a PROTECTION to the very rich, not an inconvenience

they would be just wrong. I hope someone with standing sues

I don't think it's that clear cut. The functional parts probably aren't copyrightable, only the stylistic ones. It's going to be a mix of courts applying laws in new ways that hasn't been done before and fact specific questions about what actually persisted through the LLM if it goes to court.

I'd be fascinated to see what happens if it does. Both in the analyses that we'd get of what the LLM did to the codebase and on the legal decisions on what the copyrightable creative elements in code actually are.

If I was the author though... there would be no way that I would be volunteering to be a test case like this. Also seems just rude for no reason.

It probably would have been less bad if he had chosen MPL-2.0 or LGPL-2.1-or-later. But he chose MIT, which cuts at the core of the intent of licensing the project with a share-alike license.

Tell me, can I create a copyrighted video that's not GPL licensed using ffmpeg? Now tell me how creating a rust library using the git test suite is different?

> using the git test suite

That's not actually the case at hand here - the agents were given the original source to reference: https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...

But for the sake of argument: The test suite itself is copyrighted. To the extent the resulting work is a derivative of the test suite it is possibly infringing. For example you might example that the agent would derive variable names, function names, structure sequence and organization of the code from the test suite. It might even copy comments wholesale. Those are copyrightable things. (Which is of course just the first step in analyzing if it is infringement, there would be interesting fair use, de-minimis copying, etc arguments following a conclusion that any of those were copyrighted. A product produced this way definitely could be infringing given the right facts though).

> That's not actually the case at hand here - the agents were given the original source to reference: https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...

yeah fair - the "The canonical Git source code we're targeting to replicate the functionality of is in the git/ subdirectory." part makes this hard to argue against.

> To the extent the resulting work is a derivative of the test suite it is possibly infringing

It's this bit that I have a problem with. If I run the test, it fails and reports a failure. Now I write code and run the test again. What is the theory there that code that I wrote infringes.

Simplify this down:

Assume the following is copyrighted:

    fn test_sum() {
        assert_eq!(sum(1, 1), 2);
    }
Does writing the following code:

    fn sum(a: u8, b: u8) {
        a + b
    }
infringe on the test copyright?

Writing

    fn sum(a: u8, b: u8) {
        a + b
    }
Doesn't infringe upon copyright period, because there's no creative element in that work.

Imagine a more substantial example though. Perhaps you have a test that checks that some file written in a binary format is correct, and gives names (creative elements) to each field of the format that it prints when you mess up the field, and has comments describing why the bytes are laid out like they are (the comments being copyrightable even if the facts they describe aren't), and the LLM copies those field names and comments verbatim... Now it's quite likely that the LLMs work is a derivative of the test suite.

> Doesn't infringe upon copyright period, because there's no creative element in that work.

There's likely a threshold at some point. It's helpful to look at a minima and then continue from there though.

I'm curious if there's case law that supports your assertions here?

For that assertion in particular I believe I'm practically parroting a ruling by the district court in Oracle vs Google about some extremely simple Java functions that Oracle claimed Google copied. Though I can't say I checked to make sure I'm remembering right.

You're recalling it right, but there's a nice quote from Judge Alsup in that case that talks about this exact situation:

> “So long as the specific code used to implement a method is different, anyone is free under the Copyright Act to write his or her own code to carry out exactly the same function or specification...”

Here given that this is rust and the original expression is C, the implementations cannot be the same by definition.

That's essentially the same thing as modding a game, though. I know there have been lawsuits to stop modding, but I don't think any were successful.

If you did it in a loop until the test passed, maybe?

Your result is essentially impossible without the original. With ffmpeg, your result does not depend on ffmpeg specifically - you can use any video creation tool.

A GPL tool that processes data doesn't virally transfer the license to its output. Copyrighted ffmpeg code isn't incorporated into the video output. The LLM didn't just conjure up equivalent behavior to git without ingesting the code and transforming it as new output. There is no other behavioral description that would reproduce all needed functionality.

Medium, substitutibility, basics of copyright law.

Fair point on medium - this was a lazy example.

Substitutibility probably doesn't apply here in the way you're implying and if it did it would likely be hampered by the 9th circuits findings about transformation in sony v connectix. Arguments here likely would look at rust not having a stable ABI, and hence not being inherently substitutable as a libray (grit-lib), less clear as an executable (grit-cli) on that side

basics of copyright law - the fundamental thing being protected is the expression... is a rust program's expression the same expression as a c program? I'd say generally not.

The test suite could test aspects of the architecture/design of the codebase that are not necessary for interoperability and constitute novel expression of a piece of software in a way that is not at all language specific.

By definition a test suite is about testing interoperability with the test suite. An HTTP test suite should likely test for whether response code 418 is implemented a particular way and while humorous it would still be an interop test no?

No, the git test suite is about testing the git codebase. If you want something like that, you need a conformance suite, which does not exist for git.

If feeding the source code through a complier yields a derivative work, why wouldn't feeding it to an LLM give the same result?

Because compilers and LLMs do different things, and what is done matters, so you can't reason by stepping from one to the other.

Compilers don't axiomatically yield derivative works, they simply in practice do because for non-trivial programs they preserve copyrightable elements of the work in the output.

So, if we will compile or decompile code using LLM instead of a compiler, then we can use the result for free?

(LLM can translate code to/from other code or to/from a machine code).

Well compilers are a mechanical transformation and if that were sufficient to free you of IP law then IP law wouldn't work.

An LLM is also a computer program which takes input and produces output related in some way to that input. However I don't think most people would view it as a "mere" mechanical transformation. One could tautologically argue that an LLM blends the user input with the training inputs which is a sort of transformation and further that the LLM itself is a computer program thus it is mechanical in nature. However it should be immediately obvious that such an overly literal interpretation is in danger of subsuming human work as well. Where the boundary lies is an unanswered question.

Related, compilers can pose a problem depending on what the output includes. For example common lisp compilers that aren't under a permissive license are a minefield because regardless of what anyone might say the image that gets output includes (approximately) the full language implementation verbatim in addition to the user's program.

functional parts not being copyrightable means that you can't claim a program is a copyright violation based on the fact it does the exact same thing based on compatibility reasons (you can copy what the program does). E.g. git stores refs in .git/refs, so does grit, that's not a violation. You still can't copy the program.

Yes... and now we get to the fact specific question of "did they copy the program". Or actually the answer to that is plainly "no" - they made something similar from it - and didn't run ctrl-c ctrl-v in an unlicensed manner, but "did they copy the relevant facets of the program into the new similar thing".

Making something similar is copying for the purpose of copyright law. If I trace over a Disney character it's still copyright Disney.

No. You're allowed to make a similar tool, the functional elements are not copyrightable. There's a long history, predating LLMs by many decades, of doing this in the software industry.

My use of the word "similar" does not imply here that I think it's obvious that they are "similar" in any copyrightable elements - whether they are or not is one of the interesting questions I think this case would have to resolve.

Incidentally you're also allowed to make similar creative elements so long as they aren't copies and you did so independently... which could actually come up in a case like this (imagine the LLM produced a similar function to some function in the original... but the original wasn't in the context window at the time. Not at all unlikely with code where there often is only one or two natural ways to write something).

I suspect that the issue is more likely that the LLM code doesn't have an author and hence some parts of it can't be licenses, it's less likely that it's infringing on git's copyright for various reasons. (I am not a lawyer, but I do read copyright law for funsies).

https://www.copyright.gov/newsnet/2025/1060.html

> It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements. This can include situations where a human-authored work is perceptible in an AI output, or a human makes creative arrangements or modifications of the output, but not the mere provision of prompts.

Well that's interesting.

Also "just" the legal opinion of a government office. It has yet to be tested in court

why wouldn't it? If you run git through a compiler it's still copyright the git devs, same if you run it through an LLM.

What makes you think that's what the article says that it did? There's a lot of specific nuance and it doesn't say that anywhere. In fact it speaks of making a test suite pass only. This is the classic cleanroom bios from specs approach but no need to extract it as the test is available to run and there's nothing in the GPL that suggests that running a test suite infects software that you run it on.

Surely git’s source is already in LLM’s training corpus. So this is far from clean room approach.

Not a fan of this trend of "cleaning" GPL licensed software and releasing under permissive licenses. Also why I'm not a fan of UUtils nor Canonical's early adoption of it in Ubuntu.

The intent here is extraction of all the value provided by copyleft projects without the obligation to give back. Wether it's technically legal or not, it's disgusting behavior IMO.

That’s explicitly not what’s happening with uutils; they have contributed fixes and test cases back to upstream

And just like that, it was forked by Microsoft a few days ago. Handed to them on a silver platter.

> Not a fan of this trend of "cleaning" GPL licensed software > Wether it's technically legal or not, it's disgusting behavior IMO.

GNU was originally developed to "clean" UNIX from the AT&T license.

An idea...

Take this (assuming it's not slop), relicence as GPL, submit upstream (imagine it's accepted for a moment...).

If they proceed with license washing then from the Rust version, it's certainly derived work.

This is not a proper black-box reimplementation, I doubt they can get away with that. And that's not mentioning all other obvious ethical concerns of course.

black-box/clean-room isn't necessarily required, though. It does make it a lot harder to argue in court, of course.

I don't care if they can convince a judge. The fact that they even want to in the first place tells me what kind of people they are.

F-ing scumbags. It's already free, but they still decide to steal it.

I'm not a copyright lawyer, but it seems pretty clear to me you can't wash a license using an LLM.

[US jurisdiction]: Anything in the result written by the LLM can not be copyright by anyone.

Anything in the result written by a human can be, and if it was all emitted by the LLM then that portion originally written by a human carries its own copyright.

As a work of an LLM, the entirety presumably can not be copyright, at all. Portions written by humans presumably carry their original copyright.

> [US jurisdiction]: Anything in the result written by the LLM can not be copyright by anyone.

This is a bit stronger than the actual report where this has been discussed finds. See part 2 in https://www.copyright.gov/ai/ for details, but TL;DR, parts where humans have control over the expression may be copyrightable. But working out which parts those are is likely a difficult question (would likely require proof of provenance across many of those LLM sessions)

Knowing what you don't know is such an important skill in life and your career. And I 100% agree with you that the author is, well, off their rocker.

Let me give an example: I could take Goldeneye from the N64, extract the binary and then run it through an LLM to disassemble it and possibly rewrite it in a modern higher-level language. Do you think Nintendo would look at that and say "well, he did a lot of work so he's escaped our license"? Of course not. It's just silly.

ingesting the source code and producing output in another language is quite clearly a derivative work. You don't need to be an IP lawyer to figure that out.

Now, if you went to Calude and gave it documentation and told it to produce something that was compatible, would that be a derivative work and thus covered by the GPL? I would guess probably. But I'm not 100% sure anymore. I wouldn't risk it however.

Here's another thought experiment: what if someone takes this supposedly MIT licensed source tree, plugs it into another LLM and asks it to produce the output in C? Now how is it licensed? It might be very similar. After all, there are only so many ways to produce a SHA1 hash and so many ways to do a command line parser.

But this then makes it an interesting legal issue. In the Oracle v. Google court case, this was a key issue. Google successfully argued there's only so many ways to write a loop so just because a loop is similar to the source, that doesn't mean it's copyright infringement (as Oracle argued).

Anyway, it's a crazy position to take.

> Knowing what you don't know is such an important skill in life and your career. And I 100% agree with you that the author is, well, off their rocker.

They aren't the only ones - look at the number of people in this thread who are arguing that this is analogous to producing a movie with ffmpeg - just because ffmpeg is GPL, does not make your movie GPL.

I am struggling to understand how such a high level of cognitive dissonance is possible: They believe both a) that the license can be laundered in this manner, and that b) the license they put on the result is effective!

Well that is already how it is done with numerous multi-decade open rewrites of closed games. They usually require the asset pack.

I don't know how this squares with law, but Oracle v Google gave a very valuable judgment to the public that an API is not copywritable. If we take the LLM out of it, that's all we are talking about in the pure case.

Of course, we can't take the LLM out, but it is the starting point.

> Well that is already how it is done with numerous multi-decade open rewrites of closed games

Serious such rewrites don't start with the code of the closed game!

> I don't know how this squares with law, but Oracle v Google gave a very valuable judgment to the public that an API is not copywritable. If we take the LLM out of it, that's all we are talking about in the pure case.

Not at all. The LLM used to write grit has seen the git code. That is what we're talking about here.

> Of course, we can't take the LLM out, but it is the starting point.

The LLM isn't the important thing. The important thing is that the git source code was used to make grit.

>Serious such rewrites don't start with the code of the closed game!

No, but they often involve reverse engineering the binary pretty heavily.

heh - https://github.com/n64decomp/007

game decompilation and emulation is as old as computing