I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
The West can end the endless pain and legal hurdles to innovation by limiting the copyright. They can do it if there is will to open up the gates of information to everyone. The duration of 70 years after death of the author or 90 years for companies is excessively long. It should be ~25 years. For software it should be 10 years.
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
The vast majority of books don't generate any profits past the first few years, so I prefer Lawrence Lessig's proposal of copyright renewal at five-year intervals with a fee. Under this scheme, most books would enter the public domain after five years
https://www.econlib.org/library/Columns/y2003/Lessigcopyrigh...
Lessig: Not for this length of time, no. Copyright shouldn’t be anywhere close to what it is right now. In my book I proposed a system where you’d have to renew after every five years and you get a maximum term of 75 years. I thought that was pretty radical at the time. The Economist, after the Eldred decision, came out with a proposal—let’s go back to 14 years, renewable to 28 years. Nobody needs more than 14 years to earn the return back from whatever they produced.
Lessig’s proposal is excellent. A long time ago I wrote 10 books for publishers like McGraw-Hill, J Wiley, Springer-Verlag, etc.
For many reasons I switched to writing using a Creative Commons license using Lulu, LeanPub, and my own web site for distribution. This has been a win for me economically, it feels good to add to the commons, and it is fun.
Won't people just wait 5 years to buy the book?
I think western companies will be just fine -- Anthropic is settling because they illegally pirated books from LibGen back in 2021 and subsequently trained models on them. They realized this was an issue internally and pivoted to buying books en masse and scanning them into digital formats, destroying the original copies in the process (they actually hired a former lead in the Google Books project to help them in this endeavor!). And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.
So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.
And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.
(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)
It's easier for one company to digitize and sell/share than for many companies to do it individually.
Western companies will be fine but sharing data in ways that would be illegal in the US does help other companies outside the US.
>And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.
You mean does constitute fair use?
yep, silly me
This isn’t is a race to the bottom. They could have bought these books instead of pirating them.
It's naive to think Chinese models have a free pass. Local censorship, language/data biases, and export restrictions cut both ways.
No, it’s naive to actually assume they don’t have a free pass. The other way.
But most marginal training of Anthropic, OpenAI and Google models is done on LLM paraphrased user data on those platforms. That user data is proprietary and obviously way more valuable than random books.
Good, if AI is such a great thing, why wouldn't we want the 2+ billion Chinese to have it also?
True enough, but training on synthetic data now seems to be pushing SOTA.