Excuse my ignorance if by "distributed training" you mean a specific process, but couldn't this be considered a step toward distributed training? If nations train models independently and then later distill them into a single model, all the work (both the compute and the research processes) are distributed for the initial training phase.

I mean it as in, train a model across different clusters instead of a centralized cluster. It's been shown that it's possible to train 10B models this way. If more research effort was put into this, that would be great

I don't think your approach would work because you can't create a strong model from distilling several weak models.

https://www.primeintellect.ai/blog/intellect-1

https://www.primeintellect.ai/blog/intellect-2-release