I don't agree with that definition. For a given model I want to know what I can/cannot expect from it. To have a better understanding of that, I need to know what it was trained on.
For a (somewhat extreme) example, what if I use the model to write children's stories, and suddenly it regurgitates Mein Kampf? That would certainly ruin the day.
Are you going to examine a few petabytes of data for each model you want to run, to check if a random paragraph from Main Kampf is in there? How?
We need better tools to examine the weights (what gets activated to which extent for which topics, for example). Getting full training corpus, while nice, cannot be our only choice.
> Are you going to examine a few petabytes of data for each model (...) How?
I can think of a few ways. Perhaps I'd use an LLM to find objectionable content. But anyway, it is the same argument as you can have against e.g. the Linux kernel. Are you going to read every line of code to see if it is secure? Maybe, or maybe not, but that is not the point.
The point is now a model is a black box. It might as well be a Trojan horse.
Let's pretend for a moment that the entire training corpus for Deepseek-R1 were released.
How would you download it?
Where would you store it?
I mean many people I know have 100tb+ in storage at home now. A large enough team of dedicated community members cooperating and sharing compute resources online should be able to reproduce any model.
You would use an LLM to process a few petabytes of data to find a needle in the haystack?
Cheaper to train your own.
Too bad. The OSI owns "open source".
Big tech has been abusing open source to cheaply capture most of the internet and e-commerce anyway, so perhaps it's time we walked away from the term altogether.
The OSI has abdicated the future of open machine learning. And that's fine. We don't need them.
"Free software" is still a thing and it means a very specific and narrow set of criteria. [1, 2]
There's also "Fair software" [3], which walks the line between CC BY-NC-SA and shareware, but also sticks it to big tech by preventing Redis/Elasticsearch capture by the hyperscalers. There's an open game engine [4] that has a pretty nice "Apache + NC" type license.
---
Back on the main topic of "open machine learning": since the OSI fucked up, I came up with a ten point scale here [5] defining open AI models. It's just a draft, but if other people agree with the idea, I'll publish a website about it (so I'd appreciate your feedback!)
There are ten measures by which a model can/should be open:
1. The model code (pytorch, whatever)
2. The pre-training code
3. The fine-tuning code (which might be very different from the pre-training code)
4. The inference code
5. The raw training data (pre-training + fine-tuning)
6. The processed training data (which might vary across various stages of pre-training and fine-tuning: different sizes, features, batches, etc.)
7. The resultant weights blob(s)
8. The inference inputs and outputs (which also need a license; see also usage limits like O-RAIL)
9. The research paper(s) (hopefully the model is also described and characterized in the literature!)
10. The patents (or lack thereof)
A good open model will have nearly all of these made available. A fake "open" model might only give you two of ten.
---
[1] https://www.fsf.org/
[2] https://en.wikipedia.org/wiki/Free_software
[3] https://fair.io/
[4] https://defold.com/license/
[5] https://news.ycombinator.com/item?id=44438329