May 2024 UniSuper incident: https://cloud.google.com/blog/products/infrastructure/detail...
https://www.unisuper.com.au/about-us/media-centre/2024/a-joi...
A joint statement from UniSuper CEO Peter Chun and Google Cloud CEO Thomas Kurian
8 May 2024
UniSuper and Google Cloud understand the disruption to services experienced by members has been extremely frustrating and disappointing. We extend our sincere apologies to all members.
While supporting UniSuper to bring its systems back online, Google Cloud has been conducting a root cause analysis.
Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events, where an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.
This is described as an isolated, “one-of-a-kind occurrence” that has never before occurred with any Google Cloud client globally. This should not have happened. Google Cloud has identified the sequence of events and taken measures to ensure it does not happen again.
Why did the outage last so long?
UniSuper had duplication across two geographies as protection against outages and data loss. However, the deletion of the Private Cloud subscription triggered deletion across both geographies.
Restoring the Private Cloud required significant coordination and effort between UniSuper and Google Cloud, including recovery of hundreds of virtual machines, databases, and applications.
I wrote about the UniSuper issue at the time: https://danielcompton.net/google-cloud-unisuper. It was a pretty nasty bug where their VMWare environment was created with a one-year expiry date, but was one "resource" from the perspective of Google Cloud.
"UniSuper’s production Google Cloud VMware Engine (GCVE) private cloud was automatically deleted one year after it’s creation due to a misconfiguration in how it was created. When it was created, there was a bug in the creation script which passed a null value."
That's pretty amazing. Not due to a cascading failure from someone changing a config deep inside of a system that caused a bunch of unintended effects, just someone who messed up writing a shell script?
Creating stuff with 1yr (implicit) expiry by default is just a delayed footgun tbh
"deletion of the Private Cloud subscription triggered deletion across both geographies"
It's called single point of failure, and it's the nightmare of everyone who was ever in charge of safety.
The instant cascading worldwide deletion upon closing or deleting a subscription sounds like a recipe for disaster. Why not mark it for deletion and delete say... a day or a week later?
From personal experience, as a customer who once did something stupid: Google Cloud does soft deletes. But you need to reach out to support fast enough. And really, if you deleted something important and discovered it only the next day, and not within minutes, you're having a bigger issue that a soft delete won't solve.
It’s a good question. That said unless there are compliance or fallback concerns i would prefer a service that burns my data on departure.
No, that's the naive view
Because in case of a compromise/unauthorized access that's exactly what you don't want to happen
> No, that's the naive view
No, not really. That's pretty basic stuff. You would do well in reading up on the shared responsibility model. Customers are responsible for setting up their own infrastructure, and platform/service providers are only responsible for the services they manage. Even then, stuff like persisted data is still recoverable by design.
But you are absolutely responsible for the service you put together. This is a basic principle for around two decades. Infrastructure as code tools are pervasive and ubiquitous for over a decade.
Either mark-for-delete has the same impact as deleting in terms of shooting all the Cloud resources associated with the subscription, at which point the outage still happens but maybe the recovery is smoother or you've just delayed the inevitable by a week because no one will look at it unless there is actual impact.
You just turn it all off. So yes, the disruption is the same but restoral is much smoother. Much easier said than done - that has be baked into every service and there would certainly be a cost from it that would have to be passed along to everyone.
> The instant cascading worldwide deletion upon closing or deleting a subscription sounds like a recipe for disaster.
I don't agree. What do you expect to happen when you explicitly delete your user account? Do you expect your systems to remain in operation for a week? That itself would be a major risk and liability, as your whole infrastructure would still be up even though you cut your access to it.
Also, isn't your whole infrastructure expected to be automatically deployed with IaC? The notable exception is data, which is already soft deleted and recoverable through customer support.
All in all, where do you expect the customer's responsibility to end and the cloud provider's to start? The shared responsibility model is covered by any intro course in no uncertain terms.