The assignment of blame for misconfigured cloud infra or DOS attacks is so interesting to me. There don't seem to be many principles at play, it's all fluid and contingent.

Customers demand frictionless tools for automatically spinning up a bunch of real-world hardware. If you put this in the hands of inexperienced people, they will mess up and end up with huge bills, and you take a reputational hit for demanding thousands of dollars from the little guy. If you decide to vet potential customers ahead of time to make sure they're not so incompetent, then you get a reputation as a gatekeeper with no respect for the little guy who's just trying to hustle and build.

I always enjoy playing at the boundaries in these thought experiments. If I run up a surprise $10k bill, how do we determine what I "really should owe" in some cosmic sense? Does it matter if I misconfigured something? What if my code was really bad, and I could have accomplished the same things with 10% of the spend?

Does it matter who the provider is, or should that not matter to the customer in terms of making things right? For example, do you get to demand payment on my $10k surprise bill because you are a small team selling me a PDF generation API, even if you would ask AWS to waive your own $10k mistake?

How about spending caps / circuit breakers? Doesn't seem an unsolveable problem to me.

Then you’re the person who took down their small business when they were doing well.

At AWS I’d consistently have customers who’d architected horrendously who wanted us to cover their 7/8 figure “losses” when something worked entirely as advertised.

Small businesses often don’t know what they want, other than not being responsible for their mistakes.

Everyone who makes this argument always assumes that every website on the internet is a for-profit business when in reality the vast majority of websites are not trying to make any profit at all, they are not businesses. In those cases yes absolutely they want them to be brought down.

Or instead of an outage, simply have a bandwidth cap or request rate cap, same as in the good old days when we had a wire coming out of the back of the server with a fixed maximum bandwidth and predictable pricing.

There are plenty of options on the market with fixed bandwidth and predictable pricing. But for various reasons, these businesses prefer the highly scalable cloud services. They signed up for this

Every business has a bill they are unprepared to pay without evaluating and approving budget, even under successful conditions and even if that approval step is a 10 second process. It's obvious that Amazon does not add this because of substantial profit over any other concern.

The solution is simple: budget caps.

Yes and no. 100% accurate billing is not available in realtime, so it's entirely possible that you have reached and exceeded your cap by the time it has been detected.

Having said that, within AWS there are the concepts of "budget" and "budget action" whereby you can modify an IAM role to deny costly actions. When I was doing AWS consulting, I had a customer who was concerned about Bedrock costs, and it was trivial to set this up with Terraform. The biggest PITA is that it takes like 48-72 hours for all the prerequisites to be available (cost data, cost allocation tags, and an actual budget each can take 24 hours)

The circuit breaker doesn’t need to be 100% accurate. The detection just needs to be quick enough that the excess operating cost incurred by the delay is negligible for Amazon. That shouldn’t really be rocket science.

We're talking about a $2.5T company. Literally every example in this thread is already negligible to Amazon already without circuit breakers.

Implementing that functionality across AWS would cost orders of magnitude more than just simply refunding random $100k charges.

The point is that by not implementing such configurable caps, they are not being customer friendly, and the argument that it couldn’t be made 100% accurate is just a very poor excuse.

Sure, not providing that customer-friendly feature bestows them higher profits, but that’s exactly the criticism.

They also refuse refunds. Because it is profitable, even if the customer is unhappy to pay it.

If it were highly profitable for them to implement some form of budget cap cutoffs, they would! It's obvious it's not a game they are interested in.

What about 90% accurate?

Is it simple? So what happens when you hit the cap, does AWS delete the resources that are incurring the cost and destroy your app?

Imagine the horror stories on Hacker News that would generate.

Stop accepting requests like has been the case since the beginning of time?

Or simply returns 503? Why would you go directly to destroying things??

Suppose you’re going over the billing cap based on your storage consumption, how would AWS stop the continued consumption without deleting storage?

Why would they need to delete storage, they could just not accept past the cap.

Storage billing is partly time-based.

EBS is billed by the second (with a one minute minimum, I think).

Once a customer hits their billing cap, either AWS has to give away that storage, have the bill continue to increase, or destroy user data.

I think most of the "horror stories" aren't related to cases like this. So we can at least agree most such stories could be easily avoided, before we looked at solutions to these more nuanced problems (one of which would be clearly communicating the mechanism of a limit and what would be the daily cost of maintaining the maxed storage - and for a free account the settings could be adjusted for these "costs" to be within free quota)

Not everything on AWS is a Web app

TCP session close? Don't reply back the UDP response? Stop scheduling time on the satellite transceiver for that account?

Interesting that you mention UDP, because I'm in the process of adding hard-limits to my service that handles UDP. It's not trivial, but it is possible and while I'm unsympathetic to folks casting shade on AWS for not having it, I decided a while back it was worth adding to my service. My market is experimenters and early stage projects though, which is different than AWS (most revenue from huge users) so I can see why they are more on the "buyer beware" side.

Everything on AWS can deny a request no matter what the API happens to be

While I can imagine having budget overload from storage, most (all?) of the "horrors" on the page are from compute or access.

Set it up so that machines are deleted, but EBS volumes remain. S3 bucket is locked-out but data is safe.

I mean, would you rather have a $10k build or have your server forcefully shut down after you hit $1k in three days?

One of those things is more important to different types of business. In some situations, any downtime at all is worth thousands per hour. In others, the service staying online is only worth hundreds of dollars a week.

So yes, the solution is as simple as giving the user hard spend caps that they can configure. I'd also set the default limits low for new accounts with a giant, obnoxious, flashing red popover that you cannot dismiss until you configure your limits.

However, this would generate less profit for Amazon et al. They have certainly run this calculation and decided they'd earn more money from careless businesses than they'd gain in goodwill. And we all know that goodwill has zero value to companies at FAANG scale. There's absolutely no chance that they haven't considered this. It's partially implemented and an incredibly obvious solution that everyone has been begging for since cloud computing became a thing. The only reason they haven't implemented this is purely greed and malice.

If you want hard caps, you can already do it. It’s not a checkbox in the UX, but the capability is there.

> Is it simple? So what happens when you hit the cap, does AWS delete the resources that are incurring the cost and destroy your app?

Sounds like you're saying "there aren't caps because it's hard".

> If you want hard caps, you can already do it. ... the capability is there.

What technique are you thinking of?

There are several satisfactory solutions available. Every other solution they offer was made with tradeoffs and ambiguous requirements they had to make a call on. It is obviously misaligned incentive rather than an impossibility. If they could make more money from it, they would be offering something. Product offering gaps are not merely technical impossibilities.

Yes, that’s exactly the expected behavior. It can alert if it’s closed to threshold. Very straightforward from my point of view.

Surely that's the fault of the purchaser setting the cap too low.

Maybe rather than completely stopping the service, it'd be better to rate limit the service when approaching/reaching the cap.

Using that logic, isn’t it the fault of the user to set up an app without rate limiting?

It's misleading to promote a free tier that can then incur huge charges without being able to specify a charge cap.

If it can incur any charges at all then it isn't free.

Maybe, but its a huge reason to use real servers instead of serverless.

I mean real servers get hit with things like bandwidth fees so it's not a 100% solution.

Not even remotely the same scale of problem. Like at all.

If your business suddenly starts generating Tbs of traffic (that is not a ddos), you'd be thrilled to pay overage fees because your business just took off.

You don't usually get $10k bandwidth fees because your misconfigured service consumes too much CPU.

And besides that, for most of these cases, a small business can host on-prem with zero bandwidth fees of any type, ever. If you can get by with a gigabit uplink, you have nothing to worry about. And if you're at the scale where AWS overages are a real problem, you almost certainly don't need more than you can get with a surplus server and a regular business grade fiber link.

This is very much not an all-or-nothing situation. There is a vast segment of industry that absolutely does not need anything more than a server in a closet wired to the internet connection your office already has. My last job paid $100/mo for an AWS instance to host a GitLab server for a team of 20. We could have gotten by with a junk laptop shoved in a corner and got the exact same performance and experience. It once borked itself after an update and railed the CPU for a week, which cost us a bunch of money. Would never have been an issue on-prem. Even if we got DDoSed or somehow stuck saturating the uplink, our added cost would be zero. Hell, the building was even solar powered, so we wouldn't have even paid for the extra 40W of power or the air conditioning.

Depends where you order your server. If you order from the same scammers that sell you "serverless" then sure. If you order from a more legitimate operator (such as literally any hosting company out there) you get unmetered bandwidth with at worst a nasty email and a request to lower your usage after hitting hundreds of TBs transferred.