I think western companies will be just fine -- Anthropic is settling because they illegally pirated books from LibGen back in 2021 and subsequently trained models on them. They realized this was an issue internally and pivoted to buying books en masse and scanning them into digital formats, destroying the original copies in the process (they actually hired a former lead in the Google Books project to help them in this endeavor!). And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.

So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.

And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.

(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)

It's easier for one company to digitize and sell/share than for many companies to do it individually.

Western companies will be fine but sharing data in ways that would be illegal in the US does help other companies outside the US.

>And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.

You mean does constitute fair use?

yep, silly me