The core of this success is this, IMO:
> Our workload is 24/7 steady. We were already at >90% reservation coverage; there was no idle burst capacity to “right size” away. If we had the kind of bursty compute profile many commenters referenced, the choice would be different.
Which TBH applies to many, many places, even if they are not aware of it.
I'd say the core of their success is running everything in a single rack in a single datacenter at first (for months? a year?) and getting lucky. Life is simple when you don't need the costs and effort of reliability upfront.
They mention having a second half-rack in a different DC.
In any case, not everyone need five nines, and usually it's just much easier to bring down a platform due to some bug in your own software rather that the core infrastructure going down at a rack level.
The point is valid, they mention adding that, so at one point they didn't have that. They're also only storing monitoring & observability data, that's never going to be mission critical for their customers.
It's probably the main reason why they were able to get away with this and why their application does not need scalability. I see they themselves are only offering two 9s of uptime.
They mentioned having a backup AWS cluster that would spin up when something happens.
YES, but let's also not ignore that the market is cold right now and the tech industry isn't really growing like crazy like it did during ZIRP.
So yeah in a hot economy anything you launch grows crazy. Then you do the thought leader talk circuit.
And in a cold economy, you can stop growing, optimize for your 24/7 steady workload, and also do the thought leader talk circuit.
Reminds me of https://www.specbranch.com/posts/one-big-server/
Nah. They could have just overprovisioned to hell for much cheaper. Boxes at Hetzner cost up to 10 times less than equal level of AWS compute. Just overprovision for cheaper. You have to overprovision on the cloud anyway - you cant risk your users waiting 1-2 minutes until your new nodes/pods come up. So 'cloud is good for spiky load' argument is just a lie we tell ourselves.
Well in cloud you do over provision a bit by setting autoscaling rules in a way that you still have spare capacity while the new resources are bootstrapping.
Even if you have that you'll find AWS is "out of stock" and wants you to create reservations that essentially cost the same as just having the machine 24/7.