Excuse my ignorance if by "distributed training" you mean a specific process, but couldn't this be considered a step toward distributed training? If nations train models independently and then later distill them into a single model, all the work (both the compute and the research processes) are distributed for the initial training phase.
I mean it as in, train a model across different clusters instead of a centralized cluster. It's been shown that it's possible to train 10B models this way. If more research effort was put into this, that would be great
I don't think your approach would work because you can't create a strong model from distilling several weak models.
https://www.primeintellect.ai/blog/intellect-1
https://www.primeintellect.ai/blog/intellect-2-release