The biggest part that is always missing in such comparisons is the employee salaries. In the calculation they give $354k/year of total cost per year. But now add the cost of staff in SF to operate that thing.
The biggest part that is always missing in such comparisons is the employee salaries. In the calculation they give $354k/year of total cost per year. But now add the cost of staff in SF to operate that thing.
The biggest part missing from the opposing side is: Their view is very much rooted in the pre-Cloud hardware infrastructure world, where you'd pay sysadmins a full salary to sit in a dark room to monitor these servers.
The reality nowadays is: the on-prem staff is covered in the colo fees, which is split between everyone coloing in the location and reasonably affordable. The software-level work above that has massively simplified over the past 15 years, and effectively rivals the volume of work it would take to run workloads in the cloud (do you think managing IAM and Terraform is free?)
> do you think managing IAM and Terraform is free?
No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.
In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.
yeah colo help has been great, we had a power blip and without any hassle they covered the cost and installation of UPSes for every rack, without us needing to think abt it outside of some email coordination.
Small startup teams can sometimes get away with datacenter management being a side task that gets done on an as-needed basis at first. It will come with downtime and your stability won't be anywhere near as good as Cloudflare or AWS no matter how well you plan, though.
Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.
> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.
Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.
Yes these things can be done and a lot cheaper than paying AWS.
> Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
Of course, but building and managing the software stack, managing hundreds of spares across locations, spanning across datacenters, having a hotswap backup system is not a simple engineering endeavor.
The only way to reach this point is to invest a very large amount of time into it. It requires additional headcount or to put other work on pause.
I was trying to address the type of buildout in this article: Small team, single datacenter, gets the job done but comes with tradeoffs.
The other type of self buildout that you describe is ideal when you have a larger team and extra funds to allocate to putting it all together, managing it, and staffing it. However, once you do that it's not fair to exclude the cost of R&D and the ongoing headcount needs.
It's tempting to sweep it under the rug and call it part of the overall engineering R&D budget, but there is no question a large cost associated with what you described as opposed to spinning up an AWS or Cloudflare account and having access to your battle-tested storage system a few minutes later.
To be fair, what's described here is much more robust than what you get with a simple AWS setup. At a minimum that's a multi-region setup, but if the DCs have different owners I'd even compare it to a multi-cloud setup.
not multi-cloud but multi-infrastructure. Yes there were naturally different owners since there were colos in NY, west coast, netherlands, etc.
not caring about redundancy/reliability is really nice, each healthy HDD is just the same +20TB of pretraining data and every drive lost is the same marginal cost.
When you lose 20 TB of video, where do you get 20 TB of new video to replace it?
fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.
I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.
Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.
For drive issues, this is easy. Have a stack of replacements on hand and just open a "remote-hands" ticket with the CoLo provider to swap out the drive. This can usually be done in 1-2hrs from opening the ticket.
For server issues; again, pretty easy. Just use iKVM/IPMI and iPXE to diagnose a faulty server. Again, using "remote-hands" from the CoLo provider can help fix problems if your staff does not have the skills.
In my experience, the issues that take 80% of your time are the unexpected edge cases, not the easy fixes.
Swapping drives is basically the easiest fix. The issues that cause the most problems are the hard to diagnose ones like the faulty RAM that flips a bit every once in a while or the hard drive controller that triggers an driver bug with weird behavior that doesn’t show up in the logs with anything meaningful.
Sure, but realistically, how often does this really happen? I have probably replaced 3 or 4 DIMMs over the past few years. Hardware is very reliable these days.
I've built and maintained similar setups (10PB range). Honestly, you just shove disks into it, and when they fail you replace them. You need folks around to handle things like controller / infrastructure failure, but hopefully you're paying them to do other stuff, too.
someone has to go and power-cycle the machines every couple months it's chill, that's the point of not using ceph
You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.
I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.
we have 6 months of experience operating thousands of physical disks in datacenters now! it's about a couple hours a month of employee time in steady-state.
How about all the other infrastructure. Since you are obviously not using the cloud, you must have massive amounts of GPUs and operating systems. All of that has been working together, it's not just keep watching for the physical disks and all is set.
Don't get me wrong, I buy the actual numbers regarding hardware costs, but in addition to that presenting the rest as basically a one man show in terms of maintenance hours is the point where I'm very sceptical.
oh we use cloud gpus, infiniband h100s absolutely aren't something we want to self-host. not aws tho, they're crazy overpriced; mithril and sfcompute!
we also use cloudflare extensively for everything that isn't the core heap dataset, the convenience of buckets is totally worth it for most day-to-day usage.
the heap is really just the main pretraining corpus and nothing else.
How is it going to work when the GPU is in the cloud and the storage is miles away in a local colo in SF down the street? I was under the impression that the GPUs has to go multiple times over the training dataset, which means transfer 30 PB multiple times in and out of the clouds. Is the data link even fast enough? How much are you charged for data transfer fees.
Not really. Have spare drives on the shelf and use the "remote-hands" feature from the CoLo provider. Just open a ticket to have the drive swapped. Pretty easy. For remote server connections just use IPMI/iKVM and iPXE. Again, not too difficult.
The biggest hurdle is getting a mgmt system in place to alert you when something goes wrong - especially at this size. Grafana, Loki, monit, etc are all good tools to leverage that provide quick fault identification.
Assuming that they end up hiring a full time ops person at 500k annually total costs (250k base for a data center wizard), then that's 42k extra a month, or ~$70k. Still 200k per month lower than their next best offering.
It's really not necessary.
I have four racks rather than ten, and less storage but more compute. All purchased new from HP with warranties.
Ordering each year takes a couple of days work. Racking that takes one or two.
Initial setup (seeing differences with a new generation of server etc and customizing Ubuntu autoinstallation) is done in a day.
So that's a week per year for setup.
If we are really unlucky, add another week for a strange failure. (This happened once in the 10 years I've been doing this, a CPU needed replacement by the HP engineer.)
I replaced a couple of drives in July, and a network fibre transceiver in May.
So the drives are never going to fail? PSUs are never going to burn out? You are never going to need to procure new parts? Negotiate with vendors?
This concern troll that everyone trots out when anyone brings up running their own gear is just exhausting. The hyperscalers have melted people’s brains to a point where they can’t even fathom running shit for themselves.
Yes, drives are going to fail. Yes, power supplies are going to burn out. Yes, god, you’re going to get new parts. Yes, you will have to actually talk to vendors.
Big. Deal. This shit is -not- hard.
For the amount of money you save by doing it like that, you should be clamoring to do it yourself. The concern trolling doesn’t make any sort of argument against it, it just makes you look lazy.
Very good point. There was something on the HN front page like this about self-hosted email, too.
I point out to people that AWS is between ten to one hundred times more expensive than a normal server. The response is "but what if I only need it to handle peak load three hours a day?" Then you still come out ahead with your own server.
We have multiple colo cages. We handle enough traffic - terabytes per second - that we'll never move those to cloud. Yet management always wants more cloud. While simultaneously complaining about how we're not making enough money.
I don't think the answer is so black-and-white. IMO This only realistically applies to larger companies or ones that either push lots of traffic or have a need for large amounts of compute/storage/etc.
But for smaller groups that don't have large/sustained workloads, I think they can absolutely save money compared to colo/dedicated servers using one of multiple different kinds of AWS services.
I have several customers that coast along just fine with a $50/mo EC2 instance or less, compared to hundreds per month for a dedicated server... I wouldn't call that "ten times" by any stretch.
Small companies should go for the likes of Hetzner/OVH, which is still 10+ times cheaper than AWS.
AWS is for anyone with a fear of committing to a particular amount of resource use, but once you've tried both and realised the price and performance differential, you realize you can easily way overcommit and still come out ahead, so it's not actually that scary. Plus, nobody's stopping you from continuing to spin up EC2s when your real servers are fully utilized.
Hard disagree... I think these black-and-white opinions are disingenuous, lack important nuance and are often just incorrect.
I even have customers on $3/mo EC2 instances... the cheapest dedicated server on OVH is still twenty times more expensive than that. I don't think there's any way to "come out on top" with OVH in that scenario, short of maybe claiming that the customer is somehow "doing it wrong" by only paying for what they need.
And yes hetzner/ovh have $3-4 cloud instances too, but now you're just directly competing with AWS and I don't see any benefit to call one any better than the other.
Thanks for this. I agree, there seems to be some sort of resistance to building and maintaining a CoLo infrastructure. In reality, it is not too difficult. As I mentioned above, spare parts on the shelve with the CoLo "remote hands" support and a good monitoring system can lessen the impact of almost any catastrophic issue.
For the record, I have built (and currently maintain) a number of CoLo deployments. Our systems have been running for +10 years with very little failure of either drives or PSUs. In fact, our PSU failure rate is probably 1 every 3-4 years, and we probably loose a couple of drives per year. All in all, the systems are very reliable.
They mention data loss is acceptable, so im guessing they're only fixing big outages.
Ignoring failed hdds week likely mean very little maintenance.