> unimaginable levels of copyright infringement
This isn't how copyright works. The models don't wholesale encode literal information from original works and are substantive transformations. Now, you yourself as a user can use the models and weights to infringe on a copyright.
There have been some US cases about this, but it isn't generally settled internationally. "Fair use" is a US specific thing. Even in the US there are ongoing cases.
Paper about how weights are a derivative work of the training data: https://arxiv.org/abs/2407.13493
Currently in progress law suits about AI copyright: https://informationisbeautiful.net/visualizations/the-rise-o...
Yeah, I'm familiar with that argument re derivative work, but weights aren't really what's being shipped or sold, and I think it's reasonable to argue that the generated tokens aren't derivative but substantively transformed.
That said, I would prefer a situation where hyper-scalers make an effort to compensate sources of good data, e.g. newspapers and so on.
Like it or not, Bartz v. Anthropic established that as fair use. So it isn't legally copyright infringement as currently understood under the law. This may change but it isn't obviously wrong.
I think parent poster was referring to the open secret that the early models were trained on massive collections of pirated novels and textbooks.