> Blanchard is, of course, familiar with the source code, he's been its maintainer for years.

I would argue it's irrelevant if they looked or didn't look at the code. As well as weather he was or wasn't familiar with it.

What matters is, that they feed to original code into a tool which they setup to make a copy of it. How that tool works doesn't really matter. Neither does it make a difference if you obfuscate that it's an copy.

If I blindfold myself when making copies of books with a book scanner + printer I'm still engaging in copyright infringement.

If AI is a tool, that should hold.

If it isn't "just" a tool, then it did engage in copyright infringement (as it created the new output side by side with the original) in the same way an employee might do so on command of their boss. Which still makes the boss/company liable for copyright infringement and in general just because you weren't the one who created an infringing product doesn't mean you aren't more or less as liable of distributing it, as if you had done so.

>that they feed to original code into a tool which they setup to make a copy of it

Well, no. They fed the spec (test cases, etc) into a tool which made a new program matching the spec. This is not a copy of the original code.

But also this feels like arguing over the color of the iceberg while the titanic sinks. If you have a tool that can make code to spec, what is the value in source code anymore? Even if your app is closed-source, you can just tell claude to write new code that does the same thing.

Blanchard fed the spec to the tool, and Anthropic fed the code to the tool, so Blanchard didn't do anything wrong, and Anthropic didn't do anything wrong. Nothing to see here.

> Blanchard fed the spec to the tool,

Yes...

> and Anthropic fed the code to the tool,

Presumably, as part of the massive amount of open-source code that must have been fed in to train their model.

> so Blanchard didn't do anything wrong, and Anthropic didn't do anything wrong. Nothing to see here.

This is meant as irony, right?

Everyone writes as if he just fed the spec and tests to Claude Code. Ignoring for now that the tests are under LGPL as well, the commit history shows that this has been done with two weeks of steering Claude Code towards the desired output. At every one of these interactions, the maintainer used his deep knowledge of the chardet codebase to steer Claude.

if the actual text of the code isn't the same or obviously derivative, copyright doesn't apply at all.

What does derivative mean here? Because IMO it means that the existing work was used as input. So if you used a LLM and it was trained on the existing work, that's a derivative work. If you rot13 encode something as input, so you can't personally read it, and then a device decides to rot13 on it again and output it, that's a derivative work.

In order for it to be creatively derivative you would need to copy the structure, logic, organization, and sequence of operations not just reimplement the functionality. It is pretty clear in this case that wasn't done.

It's not clear at all.

As a cynical person I assume all the frontier LLMs were trained on datasets that include every open source project, but as a thought experiment, if an LLM was trained on a dataset that included every open source project _execept_ chardet, do you think said LLM would still be able to easily implement something very similar?

There is no doubt in my mind that it could still do it.

Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.

I'm not sure how you square the circle of "it's alright to use the LLM to write code, unless the code is a rewrite of an open source project to change its license".

> Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.

> I'm not sure how you square the circle of "it's alright to use the LLM to write code

You seem like you're on the cusp of stating the obvious correct conclusion: it isn't.

> Because IMO it means that the existing work was used as input

That's your opinion (since you said "IMO"), not the actual legal definition.

LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default. You can understand this more simply by calculating the model size as an inverse of a fantasy compression algorithm that is 50% better than SOTA. You'll find you'd still be missing 80-90% of the training data even if it were as much of a stochastic parrot as you may be implying. The outputs of AI are not derivative just because they saw training data including the original library.

Then onto prompting: 'He fed only the API and (his) test suite to Claude'

This is Google v Oracle all over again - are APIs copyrightable?

> LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default.

About this specific point, it is unclear how much of a defect memorization actually is - there are also reasons to see it as necessary for effective learning. This link explains it well:

https://infinitefaculty.substack.com/p/memorization-vs-gener...

> This is Google v Oracle all over again - are APIs copyrightable?

Yes this is the best way to ask the question. If I take a public facing API and reimplement everything, whether it's by human or machine, it should be sufficient. After all, that's what Google did, and it's not like their engineers never read a single line of the Java source code. Even in "clean room" implementations, a human might still have remembered or recalled a previous implementation of some function they had encountered before.

I find the "compression" argument not very strong, both because copyright still applies to (very) lossy codecs (e.g. your 16kbps Opus file of Thriller infringes, even if the original 192khz/32bit wav file was 12,000kbps), and because copyright still applies to transformed derivative works (a tiny midi file of Thriller might still be enough for the Jackson's label to get you)

See also: https://monolith.sourceforge.net/, which seeks to ask the question:

> But how far away from direct and explicit representations do we have to go before copyright no longer applies?

Copyright protects even very abstract aspects of human creative expression, not just the specific form in which it is originally expressed. If you translate a book into another language, or turn it into a silent movie, none of the actual text may survive, but the story itself remains covered by the original copyright.

So when you clone the behavior of a program like chardet without referencing the original source code except by executing it to make sure your clone produces exactly the same output, you may still be infringing its copyright if that output reflects creative choices made in the design of chardet that aren't fully determined by the functional purpose of the program.

If you pirate a movie and reencode it, does that apply as well? You can still watch the movie and it is “obviously” the same movie, even though the bytes are completely different. Here you can use the program and it is, to the user, also the same.

> If it isn't "just" a tool, then it did engage in copyright infringement

Copyright infringement is a thing humans do. It's not a human.

Just like how the photos taken by a monkey with a camera have no copyright. Human law binds humans.

Correct. The human who shares the copy is the one who engages in copyright infringement.

So, let's say that rather than actually touching any copyrighted material, a human merely tells an AI about how to go onto the internet and find copyrighted material, download it, and ingest it for training. The AI, fully autonomously, does so, and after training itself on the material deletes it so no human ever downloads, consumes, or shares it.

If we are saying AI is "more than a tool", which seems to be the case courts are leaning since they've ruled AI output without direct human involvement is not copyrightable[0], then the above seems like it would be entirely legal.

[0] https://www.copyright.gov/newsnet/2025/1060.html

Someone would likely get prosecuted if they instructed AI agent to run say a pump and dump scheme...

Even if the final output doesn't have copyright protection it might still be copyright violation. I think it could be reasonable to have work that itself violates copyright when distributed even if it does not have copy right itself.