FD: I work at Amazon, I also started my career in a time where I had to submit paper requests for servers that had turn around times measured in months.

I just don't see it. Given the nature of the services they offer it's just too risky not to use as much managed stuff with SLAs as possible. k8s alone is a very complicated control plane + a freaking database that is hard to keep happy if it's not completely static. In a prior life I went very deep on k8s, including self managing clusters and it's just too fragile, I literally had to contribute patches to etcd and I'm not a db engineer. I kept reading the post and seeing future failure point after future failure point.

The other aspect is there doesn't seem to be an honest assessment of the tradeoffs. It's all peaches and cream, no downsides, no tradeoffs, no risk assessment etc.

At another big-4 hyperscaler, we ended up with substantial downtime and a lossy migration because they didn’t know how to manage kubernetes.

Microk8s doesn’t use etcd (they have their own, simpler thing), which seems like a good tradeoff at single rack scale: https://benbrougher.tech/posts/microk8s-6-months-later/

The article’s deployment has a spare rack in a second DC and they do a monthly cutover to AWS in case the colo provider has a two site issue.

Spending time on that would make me sleep much better than hardening a deployment of etcd running inside a single point of failure.

What other problems do you see with the article? (Their monthly time estimates seem too low to me - they’re all 10x better than I’ve seen for well-run public cloud infrastructure that is comparable to their setup).

Managing a complex environment is hard, no matter whether that’s deployed on AWS or on prem. You always need skilled workers. On one platform you need k8s experts. On the other platform you need AWS experts. Let’s not pretend like AWS is a simple one-click fire and forget solution.

And let’s be very real here: if your cloud service goes down for a few hours because you screwed something up, or because AWS deployed some bad DNS rules again, the world moves on. At the end of the day, nobody gives a shit.

Maybe I've drank the koolaid but I've done both a lot of systems level work and AWS work (I don't actually use any AWS stuff in my role here interestingly) and I think for a business that needs a handful of hosts in 2 AZs I can't imagine the ROI and risk profile being better to self host.

AWS truly does let you focus on your business logic and abstracts a TON of undifferentiated work and well beyond the low hanging fruit of system updates and load balancing.

I guess put another way, providing a SaaS you need to have an SLA, those SLAs flow from SLO and SLIs and ultimately a risk profile of your hw and sw. The risk of a bad HBA alone probably means a day of downtime if you don't do things perfectly. AWS has bad HBAs, CPUs, memory, disks etc all day long every day and it's not even a blip for customers, never mind downtime. And if you don't model bad HBAs in your SLAs then your board is going to be pissed when that outage inevitably happens.

Now if you don't have SLAs and you like sysops, networkops, clusterops, dbops work then sure, YOLO.

I'm wondering if changes to tax and accounting rules for CapEx is what really sent companies to the cloud. I do know that a lot of VC-backed companies don't want to own anything physical because that's a problem to be solved by the company that acquires them a few years down the road.

Indeed Kubernetes is problematic for the complexity and fragility reasons. It's a scalability problem, it's designed for big scale with kube staffs and situations where cost savings from bin packing etc outweigh costs of the resulting complexity.

But, SLAs are no good (who cares about getting refunded).

I agree that a business should use Kubernetes only if there is a clear need for that level of infrastructure automation. It's a time and money mistake to use K8s by default.

Variants like k3 are not as complicated and problematic as k8.