Well - maybe so. But the common belief is that training itself is a violation of copyright, no matter how it's done. That's the argument I'm countering here.
Well - maybe so. But the common belief is that training itself is a violation of copyright, no matter how it's done. That's the argument I'm countering here.
The issue is that the trainers have not sought licenses for the data and instead outright pirated it.
I don't think anyone thinks that all training is a copyright violation if all the training data is licensed. For example a LLM trained on CC0 content would be fine with basically everyone.
The problem is that training happens on data that is not licensed for that use. Some of that data also is pirated which makes it even clearer that it is illegal.
But why should separate licensing be required at all? A search engine reads and indexes every word of every page it crawls. No one argues that requires licensing, only that the outputs must respect copyright. Why should training be different?
When google starting outputting summaries people asked the same questions.
If you supplant the value of the original with the original as input then you probably have some legal questions to answer.
But that's about the output, not the training. We agree: outputs that supplant the original are the problem. A model constrained to produce only fair use outputs causes no such harm — regardless of what it was trained on.