I was about to rage at you over the first sentence, because this is so often how people start trying to argue bare metal setups are expensive. But after reading the rest: 100% this. I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.

Well, they aren't wrong about the bare metal either: Every organization ends up tied to their staff, and said staff was hired to work on the stack you are using. People end up in quite the fights because their supposed experts are more fond of uniformity and learning nothing new.

Many a company was stuck with a datacenter unit that was unresponsive to the company's needs, and people migrated to AWS to avoid dealing with them. This straight out happened in front of my eyes multiple times. At the same time, you also end up in AWS, or even within AWS, using tools that are extremely expensive, because the cost-benefit analysis for the individuals making the decision, who often don't know very much other than what they use right now, are just wrong for the company. The executive on top is often either not much of a technologist or 20 years out of date, so they have no way to discern the quality of their staff. Technical disagreements? They might only know who they like to hang out with, but that's where it ends.

So for path dependent reasons, companies end up making a lot of decisions that in retrospect seem very poor. In startups if often just kills the company. Just don't assume the error is always in one direction.

Sure but I have seen the exact same thing happen with AWS.

In a large company I worked the Ops team that had the keys to AWS was taking literal months to push things to the cloud, causing problems with bonuses and promotions. Security measures were not in place so there were cyberattacks. Passwords of critical services lapsed because they were not paying attention.

At some point it got so bad that the entire team was demoted, lost privileges, and contractors had to jump in. The CTO was almost fired.

It took months to recover and even to get to an acceptable state, because nothing was really documented.

I can’t believe the CTO wasn’t fired for that.

The CTO was the one holding the bonus and promotions for tech, so he just shifted the blame down when it was "investigated".

On the other hand it's not hard to believe that the CEO and the board are as sleepy as the CTO here. And the whole management team.

The worst one was when a password for an integration with the judicial system expired. They asked the DevOps to open their email and there were daily alerts for six months. The only reason they found this happened was because a few low level operators made a big thing out of it.

I don't like talking about "regulatory capture" but this is the only reason this company still exists. Easy market when there's almost no competition.

The entire value proposition of AWS vs running one's own server is basically this: is it easier to ask for permission, or forgiveness? You're asking for permission to get a million dollars worth of servers / hardware / power upgrades now, or you're asking for forgiveness for spending five million dollars in AWS after 10 months. Which will be easy: permission or forgiveness?

I had not thought of it this way, but interesting point. I have seen this as well.

> Many a company was stuck with a datacenter unit that was unresponsive to the company's needs

I'd like to +1 here - it's an understated risk if you've got datacenter-scale workloads. But! You can host a lot of compute on a couple racks nowadays, so IMHO it's a problem only if you're too successful and get complacent. In the datacenter, creative destruction is a must and crucially finance must be made to understand this, or they'll give you budget targets which can only mean ossification.

In orgs I have seen this it is usually a symptom of the data center unit being starved of resources. It’s like they have only been given the choice of on prem but ridiculous paperwork and long lead times or pay 20x for cloud.

Like can’t we just give the data center org more money and they can over provision hardware. Or can we not have them use that extra money to rent servers from OVH/Hetzner during the discovery phase to keep things going while we are waiting on things to get sized or arrive?

I feel like companies are unreasonably afraid of cost up front, never mind that they’re going to pay more for cloud over the next 6 months, spending 6x monthly cloud cost on a single server makes them hesitate.

It’s how they always refuse to spend half my monthly salary on the computer I work on, and instead insist I use an underpowered windows machine.

Blame finance and accounting... Rent compute in the cloud can be immediately expensed against revenues. Purchasing equipment has to be depreciated over a few years. Also why spending $$$$$ on labor (salaries) to solve an ops issue rather than spending $$$$ on some software to do it happens. If the business relies on the software it looks like an ever ongoing cost of operating the business. Spending more on labor to juggle the craziness can "hide" that and make the business look more attractive to investors... Cutting labor costs is easier to improve the bottom line (in the short term).

You also don't need to commit to upfront costs. You can easily rent, rent to own/lease these resources.

[deleted]

The problem is if you over-provision and buy 2x as many resources as you need, this looks bad from a utilization standpoint. If you buy 2x as expensive cloud solutions and “auto scale” you will have a much higher utilization for the same coat.

> Or can we not have them use that extra money to rent servers from OVH/Hetzner

Or just use Hetzner for major performance at low cost... Their apis and stuff make it look like its your datacenter.

Your comment also jogged my memory of how terrible bare metal days used to be. I think now with containers it can be better but the other reason so many switched to cloud is we don’t need to think about buying the bare metal ahead of time. We don’t need to justify it to a DevOps gatekeeper.

That so many people remember bare metal as of 20+ years ago is a large part of the problem.

A modern server can be power cycled remotely, can be reinstalled remotely over networked media, can have its console streamed remotely, can have fans etc. checked remotely without access to the OS it's running etc. It's not very different from managing a cloud - any reasonable server hardware has management boards. Even if you rent space in a colo, most of the time you don't need to set foot there other than for an initial setup (and you can rent people to do that too).

But for most people, bare metal will tend to mean renting bare metal servers already configured anyway.

When the first thing you then tend to do is to deploy a container runtime and an orchestrator, you're effectively usually left with something more or less (depending on your needs) like a private cloud.

As for "buying ahead of time", most managed server providers and some colo operators also offer cloud services, so that even if you don't want to deal with a multi-provider setup, you can still generally scale into cloud instances as needed if your provider can't bring new hardware up fast enough (but many managed server providers can do that in less than a day too).

I never think about buying ahead of time. It hasn't been a thing I've had to worry about for a decade or more.

> A modern server can be power cycled remotely, can be reinstalled remotely over networked media, can have its console streamed remotely, can have fans etc. checked remotely without access to the OS it's running etc. It's not very different from managing a cloud - any reasonable server hardware has management boards. Even if you rent space in a colo, most of the time you don't need to set foot there other than for an initial setup (and you can rent people to do that too).

All of this was already possible 20 years ago, with iLO and DRAC cards.

Yes, that's true, but 20 years ago a large proportion of lower end servers people were familiar with didn't have anything like it, and so a whole lot even of developers who remember "pre-cloud" servers have never experienced servers with them.

You are right but I just think people miss the history when we talk about moving to the cloud. It was not that long ago at a reasonable size Bay Area company, I would need to justify new metal to be provisioned to standup a service I was tasked with.

The catch is that bare metal is SO cheap and performant that you can buy legions of it and have it lying around. And datacenters, their APIs and whatnot advanced so much that you can even have automations that automatically provision and set up your bare metal servers. With containers, it gets even better.

And, lets face it - arent you already overprovisioning on the cloud because you cant risk your users waiting 1-2 minutes until your new nodes and pods get up? So basically the 'autoscaling' of cloud has always been a myth.

That memory is part of the problem: it doesn't reflect today's reality. You can have an IT ops team that buys and sets up servers, and then sets up (perhaps) Kubernetes and a nice CI/CD pipeline on top of it. They can fairly easily bill individual teams for usage, and teams have to justify their costs, just like they (hopefully!) do in any sane org that's running in the cloud.

The bad old days of begging an IT ops person for a server, and then throwing a binary over the fence at them so they can grumble while they try to get it running safely in production... yeah, no, that doesn't have to be a thing anymore.

The "we" you speak of is the problem: if your org hires actual real sysadmins and operations people (not people who just want to run everything on AWS), then "you" don't have to worry about it.

It's simple enough to hire people with experience with both, or pay someone else to do it for you. These skills aren't that hard to find.

If you hire people that are not responsive to your needs, then, sure, that is a problem that will be a problem irrespective of what their pet stack is.

Considering the rapid shift from on Prem to the cloud I think it’s clearly false that the people who knew on Prem were fighting for their little area of expertise.

In my experience, the ops folks were absolutely thrilled with the arrival of the cloud because with a trivial amount of training and a couple of certifications they had a pathway to get paid as much, if not more, than devs, especially if they rebranded as “devops engineers” instead of “ops guys”.

The only pushback against the cloud, other than some of us engineers who actually were among the first to jump on the cloud, still really loved it, but also recognized that it wasn’t the best fit for all uses and carried significant risks, were people worried about data safety.

The latter concern has largely turned out to not be a real one yet, but a decade and a half later people are finally realizing that actually there are many areas where the cloud may not be the best fit.

> said staff was hired to work on the stack you are using

Looking back at doing various hiring decisions at various levels of organizations, this is probably the single biggest mistake I've done multiple times, hiring specific people using specific technology because we were specifically using that.

You'll end up with a team unwilling to change, because "you hired me for this, even if it's best for the business with something else, this is what I do".

Once I and the organizations shifted our mindset to hiring people who are more flexible, even if they have expertise in one or two specific technologies, they won't put their head in the sand whenever changes come up, and everything became a lot easier.

Exactly. If someone has "Cloud Engineer" in the headline of their resume instead of "Devops Engineer" it's already warning and worth probing. If someone has "AWS|VMWare Engineer" in their bio, it's a giant red flag to me. Sometimes it's people just being aware where they'll find demand, but often it's indicative of someone who will push their pet stack - and it doesn't matter if it's VMWare on-prem or AWS (both purely as examples; it doesn't matter which specific tech it is), it's equally bad if they identify with a specific stack irrespective of what the stack is.

I'll also tend to look closely at whether people have "gotten stuck" specialising in a single stack. It won't make me turn them down, but it will make me ask extra questions to determine how open they are to alternatives when suitable.

[dead]

The weird thing is I'm old enough to have grown up in the pre-cloud world, and most of the stuff, like file servers, proxies, dbs, etc. isn't any more difficult to set up than AWS stuff, it's just that the skills are different

Also there's a mindset difference - if I gave you a server with 32 cores you wouldn't design a microservice system on it, would you? After all there's nowhere to scale to.

But with AWS, you're sold the story of infinite compute you can just expect to be there, but you'll quickly find out just how stingy they can get with giving you more hardware automatically to scale to.

I don't dislike AWS, but I feel this promise of false abundance has driven the growth in complexity and resource use of the backend.

Reality tends to be you hit a bottleneck you have a hard time optimizing away - the more complex your architecture, the harder it is, then you can stew.

> But with AWS, you're sold the story of infinite compute you can just expect to be there, but you'll quickly find out just how stingy they can get with giving you more hardware automatically to scale to.

This is key.

Most people never scale to a size where they hit that limit, and in most organisations where that happens, someone else have to deal with it, and so most developers are totally unaware of just how fictional the "infinite scalability" actually is.

Yet it gets touted as a critical advantage.

At the same time, most developers have never ever tried to manage modern server harware, and seem think it is somehwat like managing the hardware they're using at home.

But that limit is well below on what you could get even in a gaming machine (AWS cpus are SMT threads, so a 32 core machine is actually 64 cpus by AWS) - you can get that in a high end workstation, and I'd guess that's way more power than most people end up using even in their large-ish scale AWS projects.

> AWS cpus are SMT threads

Not on the AMD machines from m7 (and the others which share the same architecture)

>I see so many people push AWS setups not because it's the best thing - it can be if you're not cost sensitive - but because it is what they know and they push what they know instead of evaluating the actual requirements.

I kinda feel like this argument could be used against programming in essentially any language. Your company, or you yourself, likely chose to develop using (whatever language it is) because that's what you knew and what your developers knew. Maybe it would have been some percentage more efficient to use another language, but then you and everyone else has to learn it.

It's the same with the cloud vs bare metal, though at least in the cloud, if your using the right services, if someone asked you tomorrow to scale 100x you likely could during the workday.

And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.

> if someone asked you tomorrow to scale 100x you likely could during the workday.

I've never seen a cloud setup where that was true.

For starters: Most cloud providers will impose limits on you that often means going 100x would involve pleading with account managers to have limits lifted and/or scrounding a new, previously untested, combination of instance sizes.

But secondly, you'll tend to run into unknown bottlenecks long before that.

And so, in fact, if that is a thing you actually want to be able to do, you need to actually test it.

But it's also generally not a real problem. I more often come across the opposite: Customers who've gotten hit with a crazy bill because of a problem rather than real use.

But it's also easy enough to set up a hybrid setup that will spin up cloud instances if/when you have a genuine need to be able to scale up faster than you can provision new bare metal instances. You'll typically run an orchestrator and run everything in containers on a bare metal setup too, so typically it only requires having an auto-scaling group scaled down to 0, and warm it up if load nears critical level on your bare metal environment, and then flip a switch in your load balancer to start directing traffic there. It's not a complicated thing to do.

Now, incidentally, your bare metal setup is even cheaper because you can get away with a higher load factor when you can scale into cloud to take spikes.

> And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.

Generally speaking, I only relatively rarely work on systems that cost less than in the tens of thousands per month and up, and what I consistently see with my customers is that the higher the cost, the bigger the bare-metal advantage tends to be as it allows you to readily amortise initial setup costs of more streamlined/advanced setups. The few places where cloud wins on cost is the very smallest systems, typically <$5k/month.

> if your using the right services, if someone asked you tomorrow to scale 100x you likely could during the workday.

"The right services" is I think doing a lot of work here. Which services specifically are you thinking of?

- S3? sure, 100x, 1000x, whatever, it doesn't care about your scale at all (your bill is another matter).

- Lambdas? On their own sure you can scale arbitrarily, but they don't really do anything unless they're connected to other stuff both upstream and downstream. Can those services manage 100x the load?

- Managed K8s? Managed DBs? EC2 instances? Really anything where you need to think about networking? Nope, you are not scaling this 100x without a LOT of planning and prep work.

> Nope, you are not scaling this 100x without a LOT of planning and prep work.

You're note getting 100x increase in instances without justifying it to your account manager, anyway, long before you figure out how to get it to work.

EC2 has limits on the number of instances you can request, and it certainly won't let you 100x unless you've done it before and already gone through the hassle to get them to raise your limits.

On top of that, it is not unusual to hit availability issues with less common instance types. Been there, done that, had to provision several different instance types to get enough.

I hit it quite frequently with a particularly popular eks node instance type in us-east-1 (of course). I’m talking requesting like 5-6 instances, nothing crazy. Honestly, I wonder if ecs or fargate have the same issue.

So, I was around back then and am around now as a principal and this comment doesn't really pass the reality sniff test.

Its a lot worse than this in terms of AWS cost for apps that often barely any people use. They're often incorrectly provisioned and the AWS bill ends up in the hundreds of thousands or millions and could have been a few thousand in bare metal on Hetzner with a competent sysadmnin team. No, its not harder to administer bare metal. No, its not less reliable. No, its not substantially harder to scale for most companies to do bare metal(large fortune 50 excluded).

I've been seeling a cost-reduction service for a while, and the hardest aspect of selling it is that so many people on the tech side doesn't care because they don't seem to be held to account to the drain they cause.

I can go in and guarantee that my fees are capped at a few months worth of their savings, and still it's a hard sell with a lot of teams who are perfectly happy to keep burning cash.

And I'll note, as much as I love to get people off AWS, most of the times people can massively reduce their bill just by using AWS properly as well, so even if bare metal was bad for their specific circumstances they're still figuratively setting fire to piles of cash.

> people push AWS setups not because it's the best thing - it can be if you're not cost sensitive

This is so weird to me, because if you're running a company, you should be cost-sensitive. Sure, you might be willing to spend extra money on AWS in the very beginning if it helps you get to market faster. But after that, there's really no excuse: profit margin should be a very important consideration in how you run your infrastructure.

Of course, if you're VC backed, maybe that doesn't matter... that kind of company seems to mainly care about user growth, regardless of how much money is being sent to the incinerator to get it.

I was checking the appetite for some cost reduction service a while back and one of the responses I got was from a CTO telling me he didn't need to care about cost because they'd just gotten funded and had lots of cash in the bank.

It's perfectly valid to not want to put engineering effort into it at the "wrong time" when delivering features will give you a higher return, but it came across as a lack of interest in paying attention to cost at all.

I saw a lot of that attitude from the tech side when I was looking at this. A lot of the time the CFO or CEO would be appallled, because they were actually paying attention to burn rates, but where often getting stonewalled by the tech side who'd often just insist all the costs were necessary - even while they often didn't know what they were spending or on what.

I only work at companies that are using cloud because I hate administering systems and I hate dealing with system administrators when I need resources.