Hacker News

Running EKS on AWS was their problem. If they didn't run EKS on AWS, they would've had a considerably simpler setup running Amazon Linux, not having to upgrade Kubernetes every 3 quarters, managing network security using security groups instead of having open internal networking, and running in a single AZ would've eliminated intra-AZ costs. In large data centers like us-east-1, an individual AZ is actually internally striped for extra redundancy, and you are much more likely to experience regional downtime than single AZ downtime, especially if you have a stable workload and do not rely on tech beyond rock-solid basics (EC2, VPC, ELB, S3, EBS). If you're willing to operate a single bare metal rack in a DC, you should be willing to run in a single AWS AZ.

I don't know how much time they spend configuring/dealing with Kubernetes, but I bet it's a large chunk of the 24 hour engineer-hours per quarter. But this is not a required expense: "EKS had an extra $1,260/month control-plane fee". Running EKS adds a massive IAM policy maintenance overhead, whereas a non-EKS (EC2 w/ golden AMIs) setup results in drastically simpler IAM policies.

NAT gateways are ~$50 a month, plus data transfer. Setting up a gateway VPC endpoint to S3 will avoid having to pay transfer charges to S3.

They were at 90% reservation capacity, so they should be using reservations for greater savings and in fact, running stable workloads with reservations is something that AWS excels at. Reservation means that you will be able to terminate and re-launch instances even when there's a spike in demand from other users--your instance capacity is guaranteed.

Running the basics on VMs also effectively avoids vendor lock-in. Every cloud provider supports VMs with a RedHat clone, VPCs, load balancing, networked storage, access controls, object storage and a fixed size fleet with auto-relaunch on instance failure.

With a consistent workload, they would have very likely escaped the downtime from AWS a week ago as well, because, as per AWS, "existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event".

With Terraform and automation for building launchable images, you can stand up a cluster quickly in any region with secure networking, including in a separate AWS account, in the same region, for the sake of testing.

With AWS, you can set up automatic EBS backups of all your data to snapshots trivially, and even send them to a 3rd locked-down account, so they can't be accidentally wiped.