Actual open source is hard without a big war chest that allows you to flagrantly steal the training data.

That may very well be the case. In fact, I'm nearly certain that you're right. But it doesn't change the fact that open weight models are altogether insufficient on a number of important dimensions regarding freedom and transparency. And so often (such as the comment I replied to, I think), even technical people seem to just ignore the difference. Open weights are just weights. No amount of open-washing changes that.

The raw training data is so large that very few parties could host it for free even if there weren't copyright barriers.

But I think you could have a full open source training software pipeline that's set up to work with Wikipedia, Common Crawl, Books3, Library Genesis, Anna's Archive, and whatever other useful data sets people can name. There would just be a step where you have to provide your own copy of Library Genesis (or whatever subset of it you have managed to obtain).