<< You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.

Agree about the physics; disagree about the larger point.

I am not questioning that servers packed together may achieve an optimal result in how we are currently doing things, but, and this is my point, what if we didn't.

<< you cannot get that with distributed training

This is entirely the wrong question to ask. The question to ask is: how it could be adapted to distributed training.

You know what I'm surprised to find out this is far more feasible than I assumed; DiLoCo + INTELLECT models demonstrate how feasible decentralized training is already, that is very surprising to me that you can get that far with so much less communication bandwidth. Not only that, but that distributed training is _more_ feasible as you scale since compute needed scales as the square of parameter count but communication scales linearly so the overhead penalty goes down.

I think the most important problem is that you have to marshall enough compute to be meaningful, and that is going to be more and more difficult as frontier compute requirements grow.

It is a genuinely interesting problem ( above my mental abilities, but there are people smarter than me that could make it work ). I agree that compute could end up being an issue as things progress. Still, it seems that portions of what would be necessary kinda exists.

But, and it is not a small but, there is no money in it. In fact, big orgs are bound to lose money should something like that succeed.