I don’t see how that helps, unless you actually mean open source, rather than open weights like most people do. Without everything that goes into the model, including training data, these things are opaque.

Actual open source is hard without a big war chest that allows you to flagrantly steal the training data.

Honest question, I wonder why that is? Surely we have smart humans that did not read and learn "all the books". Can AI not be trained by re-reading material multiple times to reinforce?

That may very well be the case. In fact, I'm nearly certain that you're right. But it doesn't change the fact that open weight models are altogether insufficient on a number of important dimensions regarding freedom and transparency. And so often (such as the comment I replied to, I think), even technical people seem to just ignore the difference. Open weights are just weights. No amount of open-washing changes that.

The raw training data is so large that very few parties could host it for free even if there weren't copyright barriers.

But I think you could have a full open source training software pipeline that's set up to work with Wikipedia, Common Crawl, Books3, Library Genesis, Anna's Archive, and whatever other useful data sets people can name. There would just be a step where you have to provide your own copy of Library Genesis (or whatever subset of it you have managed to obtain).