Hacker News

ttfvjktesd 3 days ago [ - ]

You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.

I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.

devanshp 3 days ago [ - ]

we have 6 months of experience operating thousands of physical disks in datacenters now! it's about a couple hours a month of employee time in steady-state.

ttfvjktesd 3 days ago [ - ]

How about all the other infrastructure. Since you are obviously not using the cloud, you must have massive amounts of GPUs and operating systems. All of that has been working together, it's not just keep watching for the physical disks and all is set.

Don't get me wrong, I buy the actual numbers regarding hardware costs, but in addition to that presenting the rest as basically a one man show in terms of maintenance hours is the point where I'm very sceptical.

g413n 3 days ago [ - ]

oh we use cloud gpus, infiniband h100s absolutely aren't something we want to self-host. not aws tho, they're crazy overpriced; mithril and sfcompute!

we also use cloudflare extensively for everything that isn't the core heap dataset, the convenience of buckets is totally worth it for most day-to-day usage.

the heap is really just the main pretraining corpus and nothing else.

ttfvjktesd 3 days ago [ - ]

How is it going to work when the GPU is in the cloud and the storage is miles away in a local colo in SF down the street? I was under the impression that the GPUs has to go multiple times over the training dataset, which means transfer 30 PB multiple times in and out of the clouds. Is the data link even fast enough? How much are you charged for data transfer fees.

rtp4me 3 days ago [ - ]

Not really. Have spare drives on the shelf and use the "remote-hands" feature from the CoLo provider. Just open a ticket to have the drive swapped. Pretty easy. For remote server connections just use IPMI/iKVM and iPXE. Again, not too difficult.

The biggest hurdle is getting a mgmt system in place to alert you when something goes wrong - especially at this size. Grafana, Loki, monit, etc are all good tools to leverage that provide quick fault identification.