> I'm so surprised there is so much pushback against this.. AWS is extremely expensive.

I see more comments in favor than pushing back.

The problem I have with these stories is the confirmation bias that comes with them. Going self-hosted or on-premises does make sense in some carefully selected use cases, but I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.

The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely. Everyone goes through a honeymoon phase where the servers arrive and your software is up and running and you’re busy patting yourselves on the back about how you’re saving money. The real test comes 12 months later when the person who last set up the servers has left for a new job and the team is trying to do forensics to understand why the documentation they wrote doesn’t actually match what’s happening on the servers, or your project managers look back at the sprints and realize that the average time spent on self-hosting related tasks and ideas has added up to a lot more than anyone would have guessed.

Those stories aren’t shared as often. When they are, they’re not upvoted. A lot of people in my local startup scene have sheepish stories about how they finally threw in the towel on self-hosting and went to AWS and got back to focusing on their core product. Few people are writing blog posts about that because it’s not a story people want to hear. We like the heroic stories where someone sets up some servers and everything just works perfectly and there are no downsides.

You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.

> I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.

Funnily enough, the article even affirms this, though most people seemed to have skimmed over it (or not read it at all).

> Cloud-first was the right call for our first five years. Bare metal became the right call once our compute footprint, data gravity, and independence requirements stabilised.

Unless you've got uncommon data egress requirements, if you're worried about optimizing cloud spend instead of growing your business in the first 5 years you're almost certainly focusing on the wrong problem.

> You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.

This too. Most of the massive AWS savings articles in the past few days have been from companies that do a massive amount of data egress i.e. video transfer, or in this case log data. If your product is sending out multiple terabytes of data monthly, hosting everything on AWS is certainly not the right choice. If your product is a typical n-tier webapp with database, web servers, load balancer, and some static assets, you're going to be wasting tons of time reinventing the wheel when you can spin up everything with redundancy & backups on AWS (or GCP, or Azure) in 30 minutes.

All valid and important points, but missing a painful one, also rarely represented in threads like this: flaky hardware.

Almost every bare metal success story paints a rosy picture of perfect hardware (which thankfully is often the case), or basic hard failures which are easily dealt with. Disk replacement or swapping 1u compute nodes is expected and you probably have spares on hand. But it's a special feeling to debug the more critical parts that likely don't have idle spares just sitting around. The raid controller that corrupts it's memory, reboots, and rolls back to it's previous known-good state. The network equipment that locks up with no explanation. Critical components that worked flawless for months or years, then shit the bed, but reboot cleanly.

Of course everyone built a secure management vlan and has remote serial consoles hooked up to all such devices right? Right? Oh good, they captured some garbled symbols. The vendor's first tier of support will surely not be outsourced offshore or read from a script, and will have a quick answer that explains and fixes everything. Right?

The cloud isn't always the right choice, but if you can make it work, it sure is nice to not deal with entire categories of problems when using it.

Not saying those things don’t happen, but having worked with on-prem for 2 years, and having ran ancient (13 years old currently) servers in my homelab for 5 years, I’ve never seen them. Bad CPU, bad RAM, yes - and modern servers are extremely good at detecting these and alerting you.

In my homelab, in 5 years of running the aforementioned servers (3x Dell R620, and some various Supermicros) 24/7/365, the only thing I had fail was a power supply. Turns out they’re redundant, so I ordered another one, and the spare kept the server up in the meantime. If I was running these for a business, I’d keep hot spares around.

I'm glad it's working for you! It's worked for me in the past as well, but I've also felt the pain. As I mentioned before, it's often the case that things will work, but in some ways, you need to have an increased appetite for risk.

I suppose it depends on scale and requirements. A homelab isn't very relevant IMHO, because the sample size is small and the load is negligible. Push the hardware 24/7 and the cracks are more likely to appear.

A nice-to-have service can suffer some downtime, but if you're running a non-trivial/sizable business or have regulation requirements, downtime can be rough. Keeping spare compute servers is normal, but you'll be hard pressed to convince finance to spend big money on core services (db, storage, networking) that are sitting idle as backups.

Say you convinced finance to spend

Agreed that homelab load is generally small compared to a company’s (though an initial Plex cataloging run will happily max out as many cores as you give it for days).

In the professional environment I mentioned, I think we had somewhere close to 500 physical servers across 3 DCs. They were all Dell Blades, and nothing was virtualized. I initially thought that latter bit was silly, but then I saw that no, they’d pretty well matched compute to load. If needs grew, we’d get another Blade racked.

We could not tolerate unplanned downtime (or rather, our customers couldn’t), but we did have a weekly 3-hour maintenance window, which was SO NICE. It was only a partial outage for customers, and even then, usually only a subset of them at a time. Man, that makes things easier, though.

They were also hybrid AWS, and while I was there, we spun up an entirely new “DC” in a region we didn’t have a physical one. More or less lift-and-shift, except for managed Kafka, and then later EKS.

[deleted]

> The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely.

What the modern software business seems to have lost is the understanding that ops and dev are two different universes. DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator. Having someone that helps derive the requirements for your infrastructure, then designs it, builds it , backs it up, maintains it, troubleshoots it, monitors performance, determines appropriate redundancy, etc. etc. etc. and then tells the developers how to work with it is the missing link. Hit-by-a-bus documentation, support and update procedures, security incident response… these are all problems we solved a long time ago, but sort of forgot about moving everything to cloud architecture.

> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator.

This is revisionist history. DevOps was a reaction to the fact that many/most software development organizations had a clear separation between "developers" and "sysadmins". Developers' responsibility ended when they compiled an EXE/JAR file/whatever, then they tossed it over the fence to the sysadmins who were responsible for running it. DevOps was the realization that, huh, software works between when the people responsible for building the software ("Dev") are also the same people responsible for keeping it running ("Ops").

It was very much this for me. I knew the hosting side of things because my second job as a programmer was at a small ISP that hosted custom websites. I got used to maintaining Linux web and email servers by hand over SSH. There were some common scripts, but for the most part the pattern was SSH into the server and make the changes you need to make. Most of my early startup career was like this. Closely working with hardware, the server installs, hosting configs as well as the code that actually powered things.

Jump to my first "enterprise" job and suddenly I can't fix things anymore. I have to submit tickets to other teams to look at why the thing I built isn't running as expected. That, to me, was pure insanity. The sysadmins knew fuck all about my app and as far as I was concerned barely knew how to admin systems. I knew a lot more in my 20's after all. But the friction of not running what I wrote was absolutely real and one of the main killers of productivity versus my startup days.

I also have seen this from most of the "enterprise" companies that do "DevOps" when really they just mean they have a sysadmin team who uses modern tools and IaC. The same exact friction and issues exist between dev and ops as before DevOps days. Those companies are explicitly doing DevOps wrong. When you look at the troubleshooting steps during an incident, it's identical. Bring in the devs and the ops team so we can figure out what's going on. I do think startups are more likely to get DevOps right because they aren't trying to force it on the only mental model they seem to be able to understand.

I've also found that dev teams who run and maintain their own stacks are better about automatic failure recovery and overall more reliable solutions. Whether that's due to better alignment between the app code and the app stack during development or because the dev team is now the first call when things aren't working I'm not entirely sure. Likely a mix of both.

> DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems

DevOps, conceptually, goes back to the 90s. I was using the term in 2001. If memory serves, AWS didn't really start to take off until the mid/late aughts, or at least not until they launched S3.

DevOps was a reaction to the software lifecycle problem and didn't have anything to do with AWS. If anything it's the other way around: AWS and cloud hosting gained popularity in part due to DevOps culture.

> What the modern software business seems to have lost is the understanding that ops and dev are two different universes.

This is a fascinating take, if you ask me, treating them as separate is the whole problem!

The point of being an engineer is to solve real world problems, not to live inside your own little specialist world.

Obviously there's a lot to be said for being really good at a specialized set of skills, but thats only relevant to the part where you're actually solving problems.

Read some old o’reilly books on systems administration. Solving problems in a business domain and operating the infrastructure to support that are, indeed, very different. Urban planners needn’t understand how to orchestrate a bunch of heavy construction equipment and civil engineers needn’t care about how the sight lines affect the revenue of the stores in the buildings. They’re all building the same exact thing.

The issue is that precious few devs are actually good at ops. There are a ton of abstractions that have sprung up that attempt to paper over this, and they all suck for various reasons.

If you think you need to actually know your programming language of choice to be a good dev, I have news for you about actually knowing how computers work to be good at ops.

Rarely do startups fail because of a decision like self hosting or not. In many cases it isnt even a few bad decisions but a long series of them plus outside factors which are uncontrollable.

In my experience the aws problem isnt so much that aws is that costly relative to bare metal but that most people do not execute well on aws over provisioning like mad to solve design issues.

There is a perverse benefit to sales at aws to push nonsense product too because of incentives. But people forget that aws isnt even as bad as a ton of other company spend. I have seen a fortune 100 add a few mm to their annual salesforce contract “for funnzies” because at their scale it wasnt that much money.

To me it feels like nuance has been lost.

Personally, I would never self-host some B2C or B2B application if you have less than 50 - 100 techies in a healthy org. You can get just too much from a few VMs and/or a few dedicated servers at like Hetzner, OVH, or AWS managed services. At least for the average web rest thingy with a DB and some file storage. I'm sure it's possible to find counter-examples.

On the other hand, we are about 120 devs at work now, couple thousand B2B customers, 10 Platform Ops, 7 HW & DC Ops. I guess we have more ops-people than a startup may have people. Once we get rid of VMWare licensing, our colos are ridiculously cheap when amortized across 5 years compared to AWS or cloud hosting. Once EOL, they'll also reduce cloud-costs on cheaper providers for test systems and provide spontaneous failover and disaster recovery tests.

We're now also getting good cross-team scaling processes going and at this point the big barriers are actually getting enough power and cooling, not buying/racking/maintaining systems. That will be a big price tag next year, but we've not paid that money to AWS the last two years, so it's fine.

As I keep saying internally, self-hosting is like buying a 40 ton excavator, like Large Marge or a 40 ton truck. If you have enough stuff to utilize a 40 ton truck, it's good. If you need to move food around in an urban environment, or need to move an organ transplant between hospitals, a 40 ton truck tends to be rather inefficient and very expensive to maintain and run.