Is it really the case companies like OpenAI and Anthropic will repeatedly visit this archive and slurp it all up each time they train something? Wouldn’t that just be a one time thing (to get their own copy) with maybe the odd visit to get updates? My take is the article is about monetizing unique training info and I see them being paid maybe 10-20 times a year by folks building LLMs which is maybe nothing and maybe $$$$ I don’t know.
Not a doctor, but in Anthropic's case they bought actual books and scanned rather than using pirated versions. For digital versions from a vendor that were found to be in violation of the ToS they paid to settle the issue. https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...