Isn't training material the biggest problem for truly open source LLMs (such that could compete with top tier models)? The computation part can be solved with money, but compiling a comprehensive training set that could be freely shared and free of copyright issues is pretty much impossible.
I wonder if we could gamify and democratise it somehow, like fold-at-home and wikipedia...
I've been training a teeny specialised model to run in a browser on a phone to detect harmonium notes played in a song (harmonium turns out is a pita, another story for another day), getting good labelled data is _all_ of the hard work.
That being said, maybe for cheap inference, using a big model to train something ultra-suited for the task at hand might be how we could handle local inference; thinking language specific models.
Didn't the courts decide that if it's just for learning everything is fair game?
You don't need to have fully copyright-unencumbered datasets to build Open Source AI, as that (as you say) would be impossible. https://opensource.org/ai