> Are you going to examine a few petabytes of data for each model (...) How?
I can think of a few ways. Perhaps I'd use an LLM to find objectionable content. But anyway, it is the same argument as you can have against e.g. the Linux kernel. Are you going to read every line of code to see if it is secure? Maybe, or maybe not, but that is not the point.
The point is now a model is a black box. It might as well be a Trojan horse.
Let's pretend for a moment that the entire training corpus for Deepseek-R1 were released.
How would you download it?
Where would you store it?
I mean many people I know have 100tb+ in storage at home now. A large enough team of dedicated community members cooperating and sharing compute resources online should be able to reproduce any model.
You would use an LLM to process a few petabytes of data to find a needle in the haystack?
Cheaper to train your own.