> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.
Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.
Yes these things can be done and a lot cheaper than paying AWS.
> Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
Of course, but building and managing the software stack, managing hundreds of spares across locations, spanning across datacenters, having a hotswap backup system is not a simple engineering endeavor.
The only way to reach this point is to invest a very large amount of time into it. It requires additional headcount or to put other work on pause.
I was trying to address the type of buildout in this article: Small team, single datacenter, gets the job done but comes with tradeoffs.
The other type of self buildout that you describe is ideal when you have a larger team and extra funds to allocate to putting it all together, managing it, and staffing it. However, once you do that it's not fair to exclude the cost of R&D and the ongoing headcount needs.
It's tempting to sweep it under the rug and call it part of the overall engineering R&D budget, but there is no question a large cost associated with what you described as opposed to spinning up an AWS or Cloudflare account and having access to your battle-tested storage system a few minutes later.
To be fair, what's described here is much more robust than what you get with a simple AWS setup. At a minimum that's a multi-region setup, but if the DCs have different owners I'd even compare it to a multi-cloud setup.
not multi-cloud but multi-infrastructure. Yes there were naturally different owners since there were colos in NY, west coast, netherlands, etc.
not caring about redundancy/reliability is really nice, each healthy HDD is just the same +20TB of pretraining data and every drive lost is the same marginal cost.
When you lose 20 TB of video, where do you get 20 TB of new video to replace it?