Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data
No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.
[1] https://news.ycombinator.com/item?id=46572846
[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...
[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...
also:
"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"
https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...
The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.
> I think they at least have some level of filtering for their training data to keep them legally somewhat compliant.
So far, courts are siding with the "fair use" argument. No need to exclude any data.
https://natlawreview.com/article/anthropic-and-meta-fair-use...
"Even if LLM training is fair use, AI companies face potential liability for unauthorized copying and distribution. The extent of that liability and any damages remain unresolved."
https://www.whitecase.com/insight-alert/two-california-distr...
> even proprietary content like the books themselves
This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.
Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?
> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files
Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.
Maybe y’all missed this?
https://www.washingtonpost.com/technology/2026/01/27/anthrop...
Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.
> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
The plot of Good Will Hunting would like a word.