You know what I'm surprised to find out this is far more feasible than I assumed; DiLoCo + INTELLECT models demonstrate how feasible decentralized training is already, that is very surprising to me that you can get that far with so much less communication bandwidth. Not only that, but that distributed training is _more_ feasible as you scale since compute needed scales as the square of parameter count but communication scales linearly so the overhead penalty goes down.
I think the most important problem is that you have to marshall enough compute to be meaningful, and that is going to be more and more difficult as frontier compute requirements grow.
It is a genuinely interesting problem ( above my mental abilities, but there are people smarter than me that could make it work ). I agree that compute could end up being an issue as things progress. Still, it seems that portions of what would be necessary kinda exists.
But, and it is not a small but, there is no money in it. In fact, big orgs are bound to lose money should something like that succeed.