I mean it as in, train a model across different clusters instead of a centralized cluster. It's been shown that it's possible to train 10B models this way. If more research effort was put into this, that would be great

I don't think your approach would work because you can't create a strong model from distilling several weak models.

https://www.primeintellect.ai/blog/intellect-1

https://www.primeintellect.ai/blog/intellect-2-release