Hacker News

>Something related, but different, happened with chardet. The current maintainer reimplemented it from scratch by only pointing it to the API and the test suite.

Only "pointing it". But the LLM, who can recite over 90% of a book in its training set verbatim *, would have also have had trained on the original code.

Maybe "the slop of Theseus" is a better title.

* https://the-decoder.com/researchers-extract-up-to-96-of-harr...

logicprog 2 hours ago [ - ]

Also from that exact same study (why not cite the actual study? It's quite readable) the LLMs couldn't recite more than a small fraction of many other books, often ones just as well known[0] — in fact, from the bar charts shown in the exact news article you cited, it's pretty clear that Sonnet 3.7 was a massive outlier, and so was Harry Potter and the Sorcerer's Stone, so it really seems to me like that's an extremely unrepresentative example, and if all the other LLMs couldn't recite even a small fraction of all the other books except that one outlier pairing, despite them being widely reproduced classics, why would we expect LLMs to actually regurgitate regularly, especially a relatively unknown open source project that probably hasn't been separately reproduced that many times?

Not to mention the fact that, as the other commenters mention, that appears to just... not have happened at all in this case, so it's a moot point.

[0]: https://arxiv.org/pdf/2601.02671

the_mitsuhiko 4 hours ago [ - ]

Maybe, but the LLM did not recite the chardet source code so that argument does not appear to apply here.

coldtea 2 hours ago [ - ]

To remind those not familiar with the an old standard practice, for a remake of a product X to be "clean room" and avoid copyright issues, traditionally in the industry developers working it were required to never have seen X's source code.

My argument is that while you write that it was "only pointed" to the API, an LLM, who can recite over 90% of a book in its training set verbatim, would also have trained on the original code (and can have it in "mind").

So "pointing it to the API" doesn't mean it ONLY used the API in its implementation. Could very well have used whatever internal knowledge of the behavior and architecture and choices of the original code - regardless of if it recited or translated the original verbatim or not.

So when considering this, "AI was just pointed at the API" is a weaker claim than it appears to be.

4star3star 3 hours ago [ - ]

I agree. If we look to music, how can a musician unhear what they've heard? We celebrate musicians when they cite their influences. In the case of a software library, it is a tool, not a work of art. Its beauty is in accomplishing a specific, useful task. If we can accept musicians drawing inspiration from all the music they've ever listened to, we should be able to do the same for software, especially when its internal code is unrecognizable from a similar tool.

coldtea 2 hours ago [ - ]

>I agree. If we look to music, how can a musician unhear what they've heard?

Unlike with music, in software traditionally a (human) programmer could be chosen who haven't "heard" (i.e. read the original code). That has traditionally called a "clean room" implementation (not to be confused with the software development process called "clean room").

irishcoffee 3 hours ago [ - ]

This whole "today" fascination with chardet is a classic example of manipulation. I suggest you disregard this term instead of defending it.