Hacker News

fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.

I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.

Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.