Quite close to the recent AWS outage. Let me take a look if its a major one similar to AWS.
Any guess on what's causing it?
In hindsight, I guess the foresight of some organizations to go multi-cloud was correct after all.
Quite close to the recent AWS outage. Let me take a look if its a major one similar to AWS.
Any guess on what's causing it?
In hindsight, I guess the foresight of some organizations to go multi-cloud was correct after all.
We're multi-cloud and it really saved a few workloads last week with the AWS issue.
It's not easy though.
This is the eternal tension for early-stage builders, isn't it? Multi-cloud gives you resilience, but adds so much complexity that it can actually slow down shipping features and iterating.
I'm curious—at what point did you decide the overhead was worth it? Was it after experiencing an outage, or did you architect for it from day one?
As someone launching a product soon (more on the builder/product side than infra-engineer), I keep wrestling with this. The pragmatist in me says "start simple, prove the concept, then layer in resilience." But then you see events like this week and think "what if this happens during launch?"
How did you handle the operational complexity? Did you need dedicated DevOps folks, or are there patterns/tools that made it manageable for a smaller team?
I don't think I would recommend multi-cloud right out of the gate unless you already have a lot of experience in the space or there is a strong demand from your customers. There's a tremendous amount of overhead with security/compliance, incident management, billing, tooling, entitlements, etc. There are a number of external factors that drove our decision to do it, resiliency is just one of them. But we are a pretty big shop, spending ~$10M/mo on cloud infra and have ~100 people in the platform management department.
I would recommend focusing on multi-region within a single CSP instead (both for workloads AND your tooling), which covers the vast majority of incidents and lays some of the architectural foundation for multi-cloud down the road. Develop failover plans for each service in your architecture (eg. planned/tested runbooks to migrate to Traffic Manager in the event AFD goes down)
Also choose your provider wisely. We experience 3-5x the number of service-impacting incidents on Azure that we do on AWS. I'm sure others have different experiences, but I would never personally start a company on Azure. AWS has its own issues, of course, but reliability has not been a major one (relatively speaking) over the past 10 years. Last week's incident with DynamoDB in us-east-1 had zero impact on our AWS workloads in other regions.
It's always freakin DNS...
Trusting AI without sufficient review and oversight of changes to production.
Yeah, these things never happened when humans were trusted without sufficient review and oversight of changes to production.
Do you have any insight or do you just dislike AI? Incidents like this happened long before AI generated code
I don't think it's meant to be serious. It's a comment on Microsoft laying off their staff and stuffing their Azure and Dotnet teams with AI product managers.
cost cutting attempts