Small startup teams can sometimes get away with datacenter management being a side task that gets done on an as-needed basis at first. It will come with downtime and your stability won't be anywhere near as good as Cloudflare or AWS no matter how well you plan, though.

Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.

> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.

Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...

servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.

Yes these things can be done and a lot cheaper than paying AWS.

> Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...

Of course, but building and managing the software stack, managing hundreds of spares across locations, spanning across datacenters, having a hotswap backup system is not a simple engineering endeavor.

The only way to reach this point is to invest a very large amount of time into it. It requires additional headcount or to put other work on pause.

I was trying to address the type of buildout in this article: Small team, single datacenter, gets the job done but comes with tradeoffs.

The other type of self buildout that you describe is ideal when you have a larger team and extra funds to allocate to putting it all together, managing it, and staffing it. However, once you do that it's not fair to exclude the cost of R&D and the ongoing headcount needs.

It's tempting to sweep it under the rug and call it part of the overall engineering R&D budget, but there is no question a large cost associated with what you described as opposed to spinning up an AWS or Cloudflare account and having access to your battle-tested storage system a few minutes later.

To be fair, what's described here is much more robust than what you get with a simple AWS setup. At a minimum that's a multi-region setup, but if the DCs have different owners I'd even compare it to a multi-cloud setup.

not multi-cloud but multi-infrastructure. Yes there were naturally different owners since there were colos in NY, west coast, netherlands, etc.

not caring about redundancy/reliability is really nice, each healthy HDD is just the same +20TB of pretraining data and every drive lost is the same marginal cost.

When you lose 20 TB of video, where do you get 20 TB of new video to replace it?

fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.

I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.

Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.

For drive issues, this is easy. Have a stack of replacements on hand and just open a "remote-hands" ticket with the CoLo provider to swap out the drive. This can usually be done in 1-2hrs from opening the ticket.

For server issues; again, pretty easy. Just use iKVM/IPMI and iPXE to diagnose a faulty server. Again, using "remote-hands" from the CoLo provider can help fix problems if your staff does not have the skills.

In my experience, the issues that take 80% of your time are the unexpected edge cases, not the easy fixes.

Swapping drives is basically the easiest fix. The issues that cause the most problems are the hard to diagnose ones like the faulty RAM that flips a bit every once in a while or the hard drive controller that triggers an driver bug with weird behavior that doesn’t show up in the logs with anything meaningful.

Sure, but realistically, how often does this really happen? I have probably replaced 3 or 4 DIMMs over the past few years. Hardware is very reliable these days.