oh we use cloud gpus, infiniband h100s absolutely aren't something we want to self-host. not aws tho, they're crazy overpriced; mithril and sfcompute!

we also use cloudflare extensively for everything that isn't the core heap dataset, the convenience of buckets is totally worth it for most day-to-day usage.

the heap is really just the main pretraining corpus and nothing else.

How is it going to work when the GPU is in the cloud and the storage is miles away in a local colo in SF down the street? I was under the impression that the GPUs has to go multiple times over the training dataset, which means transfer 30 PB multiple times in and out of the clouds. Is the data link even fast enough? How much are you charged for data transfer fees.