> If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright.
I don't believe this, and I doubt that the sense of copying in copyright law is so literal. For instance, if I generated the exact text of a novel by looking for hash collisions, or by producing random strings of letters, or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it. There need not have been any copy anywhere near me for this to happen. Whether it is likely or not depends on the technique used - naive techniques make this very unlikely, but techniques can improve.
It is also true that similarity does not imply copying - if you and I take an identical photograph of the same skyline, I have not copied you and you have not copied me, we have just fixed the same intangible scene into a medium. The true subjective test for copying is probably quite nuanced, I am not sure whether it is triggered in this case, but I don't think "clean room LLMs" are a panacea either.
> dirty phase produces a specification ... it is NOT defined as producing an API
This does not really sound like "the opposite of correct". APIs are usually not copyrightable, the truth is of course more complicated, if you are happy to replace "API" with "uncopyrightable specification" then we can probably agree and move on.
> it's probably not as dramatic as you think it is
In reality I am very cynical and think nothing will come of this, even if there are verbatim snippets in the produced code. People don't really care very much, and copyright cases that aren't predicated on millions of dollars do not survive the court system very long.
> I don't believe this, and I doubt that the sense of copying in copyright law is so literal.
It is actually that literal, really.
> For instance, if I generated the exact text of a novel by looking for hash collisions,
This is a copyright violation because you're using the original to construct the copy. It's not a pure RNG.
> or by producing random strings of letters,
This wouldn't be a copyright violation, but nobody would believe you.
> or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it.
This would probably be a copyright violation.
You probably think that this is hypothetical, but problems like this do actually go to court all the time, especially in the music industry, where people try to enforce copyright on melodies that have the informational uniqueness of an eight-word sentence.
> APIs are usually not copyrightable,
This was commonly believed among developers for a long time, but it turned out to not be true.
> This does not really sound like "the opposite of correct".
The important part is that information about the implementation can absolutely be in the spec without necessarily being copyrightable (and in real world clean room RE, you end up with a LOT of implementation details). You were saying the opposite, that it was a spec of the API as opposed to a spec of the implementation.
> I don't believe this, and I doubt that the sense of copying in copyright law is so literal.
What color are your bits? That's all the law cares about.
The first sentence is the title of an essay.