Hacker News

croemer 3 days ago [ - ]

Preliminary post incident review: https://azure.status.microsoft/en-gb/status/history/

Timeline

15:45 UTC on 29 October 2025 – Customer impact began.

16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.

16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.

16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.

16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.

17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.

17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.

17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.

18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.

18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.

23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.

00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.

xnorswap 3 days ago [ - ]

33 minutes from impact to status page for a complete outage is a joke.

neya 3 days ago [ - ]

In Microsoft's defense, Azure has always been a complete joke. It's extremely developer unfriendly, buggy and overpriced.

michaelt 3 days ago [ - ]

If you call that defending microsoft, I'd hate to see what attacking them looks like :)

dijit 3 days ago [ - ]

Not to put too fine a point on it, but if I have a dark passenger in my tech life it is almost entirely caused by what Microsoft wants to inflict on humanity - and more importantly; how successful they are at doing it.

amelius 3 days ago [ - ]

In commenter's defense, their comment makes no sense.

dude250711 3 days ago [ - ]

Save it for when they stick Copilot into Azure portal.

alias_neo 3 days ago [ - ]

Ha, you haven't used it recently have you? Copilot is already there, and it can't do a single useful thing.

Me: "How do I connect [X] to [Y] using [Z]?"

Copilot: "Please select the AKS cluster you'd like to delete"

GuestFAUniverse 2 days ago [ - ]

Perfect answer /s

nflekkhnnn 3 days ago [ - ]

Actually one of the inventors of k8s was the project lead for copilot in the azure portal, and deployed it over a year ago.

antonvdi 3 days ago [ - ]

They're already doing that.

sofixa 3 days ago [ - ]

> In Microsoft's defense, Azure has always been a complete joke. It's extremely developer unfriendly, buggy and overpriced.

Don't forget extremely insecure. There is a quarterly critical cross-tenant CVE with trivial exploitation for them, and it has been like that for years.

hinkley 3 days ago [ - ]

Given how much time I spent on my first real multi-tenant project, dealing with the consequences of architecture decisions meant to prevent these sorts of issues, I can see clearly the temptation to avoid dealing with them.

But what we do when things are easy is not who we are. That's a fiction. It's how we show up when we are in the shit that matters. It's discipline that tells you to voluntarily go into all of the multi-tenant mitigations instead of waiting for your boss to notice and move the goalposts you should have moved on your own.

madjam002 3 days ago [ - ]

My favourite was the Azure CTO complaining that Git was unintuitive, clunky and difficult to use

lawgimenez 3 days ago [ - ]

Sounds like he’s describing Windows phone.

ac2u 3 days ago [ - ]

Feel like I have to defend windows phone here, I liked it! Although I swore off the platform after the hardware I bought wasn’t eligible for the windows phone 8 upgrade even though the hardware was less than two years old. They punished early adopters

lawgimenez 2 days ago [ - ]

Yeah Windows Phone's first releases were decent. I have developed apps for Windows actually using Window's UWP framework but there weren't enough users on their platform sadly.

macintux 3 days ago [ - ]

Isn’t it?

Hilift 3 days ago [ - ]

Ironically, the GitHub Desktop Windows app is quite nice.

dspillett 3 days ago [ - ]

Yes. But the point is compared to Azure in places the statement was very much the pot commenting on the kettles sooty arse. And git makes no particular pretence to be particularly friendly, just that it does a particular job efficiently.

sfn42 3 days ago [ - ]

I've only used Azure, to me it seems fine ish. Some things are rather overcomplicated and it's far from perfect but I assumed the other providers were similarly complicated and imperfect.

Can't say I've experienced many bugs in there either. It definitely is overpriced but I assume they all are?

voidfunc 3 days ago [ - ]

They are all broken, weird, and expensive in their own ways. Its nothing unique to Azure.

a day ago [ - ]

[deleted]

ukblewis 3 days ago [ - ]

Some are much worse than others…

lokar 3 days ago [ - ]

For something fairly small, they are about the same.

At a large scale, azure is dramatically worse then Aws.

sfn42 3 days ago [ - ]

Worse at what?

lokar 3 days ago [ - ]

Pretty much anything

rk06 2 days ago [ - ]

Hmm, isn't that the same argument we use in defense of windows and ms teams?

campbel 3 days ago [ - ]

As a technologist, you should always avoid MS. Even if they have a best-in-class solution for some domain, they will use that to leverage you into their absolute worst-in-class ecosystem.

hinkley 3 days ago [ - ]

I see Amazon using a subset of the same sorts of obfuscations that Microsoft was infamous for. They just chopped off the crusts so it's less obvious that it's the same shit sandwich.

imglorp 3 days ago [ - ]

That's about how long it took to bubble up three levels of management and then go past the PR and legal teams for approvals.

infaloda 3 days ago [ - ]

More importantly `15:45 UTC on 29 October 2025 – Customer impact began.

16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered. ` A 19-minute delay in alert is a joke.

hinkley 3 days ago [ - ]

10 minutes to alert, to avoid flapping false positives. 10 minute response window for first responders. Or, 5 minute window before failing over to backup alerts, and 4 minutes to wake up, have coffee, and open the appropriate windows.

tremon 3 days ago [ - ]

I'd like to think that a company the size of Microsoft can afford to have incident response teams in enough time zones to cover basic operations without relying on night shifts.

hinkley 2 days ago [ - ]

That’s some very carefully chosen phrasing.

I think if you really wanted to do on call right to avoid gaps you’d want no more than 6 hours on primary per day per shift, and you want six, not four, shifts per day. So you’re only alone for four hours in the middle of your shift and have plenty of time to hand off.

Xss3 3 days ago [ - ]

That does not say it took 19 minutes for alerts to appear. Following could mean any amount of time.

hinkley 3 days ago [ - ]

It's 19 minutes until active engagement by staff. And planned rolling restarts can trigger alerts if you don't set thresholds of time instead of just thresholds of count.

It would be nice though if alert systems made it easy to wire up CD to turn down sensitivity during observed actions. Sort of like how the immune system turns down a bit while you're eating.

thayne 3 days ago [ - ]

Unfortunately,that is also typical. I've seen it take longer than that for AWS to update their status page.

The reason is probably because changes to the status page require executive approval, because false positives could lead to bad publicity, and potentially having to reimburse customers for failing to meet SLAs.

ape4 3 days ago [ - ]

Perhaps they could set the time to when it really started after executive approval.

schainks 3 days ago [ - ]

AWS either is “on it” or you they will say something somewhere between 60-90 minutes after impact.

We should be lucky MSFT is so consistent!

Hug ops to the Azure team, since management is shredding up talent over there.

sbergot 3 days ago [ - ]

and for a while the status was "there might be issues on azure portal".

ambentzen 3 days ago [ - ]

There might have been, but they didn't know because they couldn't access it. Could have been something totally unrelated.

HeavyStorm 3 days ago [ - ]

I've been on bridges where people _forgot_ to send comms for dozens of minutes. Too many inexperienced people around these days.

skeezyjefferson 3 days ago [ - ]

[flagged]

onionisafruit 3 days ago [ - ]

At 16:04 “Investigation commenced”. Then at 16:15 “We began the investigation”. Which is it?

ssss11 3 days ago [ - ]

Quick coffee run before we get stuck in mate

ozim 3 days ago [ - ]

Load some carbs with chocolate chip cookies as well, that’s what I would do.

You don’t want to debug stuff with low sugar.

normie3000 3 days ago [ - ]

One crash after another

red-iron-pine 3 days ago [ - ]

burn a smoko and take a leak

taco_emoji 3 days ago [ - ]

    16:04 Started running around screaming
    16:15 Sat down & looked at logs

not_a_bot_4sho 3 days ago [ - ]

I read it as the second investigation being specific to AFD. The first more general.

onionisafruit 3 days ago [ - ]

I think you’re right. I missed that subtlety on first reading.

neop1x 2 days ago [ - ]

>> We began the investigation and started to examine configuration changes within AFD.

Troubleshooting has completed

Troubleshooting was unable to automatically fix all of the issues found. You can find more details below.

>> We initiated the deployment of our ‘last known good’ configuration.

System Restore can help fix problems that might be making your computer run slowly or stop responding.

System Restore does not affect any of your documents, pictures, or other personal data. Recently installed programs and drivers might be uninstalled.

Confirm your restore point

Your computer will be restored to the state it was in before the event in the Description field below.

oofbey 3 days ago [ - ]

“Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations.”

Very circular way of saying “the validator didn’t do its job”. This is AFAICT a pretty fundamental root cause of the issue.

It’s never good enough to have a validator check the content and hope that finds all the issues. Validators are great and can speed a lot of things up. But because they are independent code paths they will always miss something. For critical services you have to assume the validator will be wrong, and be prepared to contain the damage WHEN it is wrong.

notorandit 3 days ago [ - ]

What puzzles me too is the time it took to recognize an outage.

Looks like there was no monitoring and no alerts.

Which is kinda weird.

hinkley 3 days ago [ - ]

I've seen sensitivity get tuned down to avoid false positives during deployments or rolling restarts for host updates. And to a lesser extent for autoscaling noise. It can be hard to get right.

I think it's perhaps a gap in the tools. We apply the same alert criteria at 2 am that we do while someone is actively running deployment or admin tasks and there's a subset that should stay the same, like request failure rate, and others that should be tuned down, like overall error rate and median response times.

And it means one thing if the failure rate for one machine is 90% and something else if the cluster failure rate is 5%, but if you've only got 18 boxes it's hard to discern the difference. And which is the higher priority error may change from one project to another.

deadbolt 2 days ago [ - ]

Just what you want in a cloud provider, right?