> The key seems to be truly independent failure domains - something NORAD achieved through physical separation that's harder to replicate in cloud environments.
It's not hard to achieve separation in cloud environments. It's just expensive, and for most, the need doesn't justify the expense. Every once in a while you do run into things like actually all the clouds in (location) are located in the same building; or you paid for two separate strands, but a backhoe has determined that both strands are in the same bundle. Or your two redundant providers of whatever use each other to provide service.
I'm always in favour of simplifying things, and in our case complexity breeds further complexity. We are making software recursively, building software on software on software on software, and this has more inherent complexity than just hardware. Why do we do this? It allows anyone to sit in their bedroom and build something in no time at all, compared to the large groups of people with specialised, expensive hardware it would require in the 1960s.
With more complexity comes more interesting failure scenarios. Simple redundancy handles problems where components outright fail, but software never really ... fails, in that sense. It always performs exactly to design – but sometimes the design is inadequate. Replication and majority voting does not handle design flaws and errors of e.g. interaction, which are increasingly common the more softwareified a product is.
Thus we need more complicated ways to build fault tolerance into systems where we have to assume that the design is wrong somewhere, and we cannot just replicate ourselves out of the problem.
> The key seems to be truly independent failure domains - something NORAD achieved through physical separation that's harder to replicate in cloud environments.
It's not hard to achieve separation in cloud environments. It's just expensive, and for most, the need doesn't justify the expense. Every once in a while you do run into things like actually all the clouds in (location) are located in the same building; or you paid for two separate strands, but a backhoe has determined that both strands are in the same bundle. Or your two redundant providers of whatever use each other to provide service.
I'm always in favour of simplifying things, and in our case complexity breeds further complexity. We are making software recursively, building software on software on software on software, and this has more inherent complexity than just hardware. Why do we do this? It allows anyone to sit in their bedroom and build something in no time at all, compared to the large groups of people with specialised, expensive hardware it would require in the 1960s.
With more complexity comes more interesting failure scenarios. Simple redundancy handles problems where components outright fail, but software never really ... fails, in that sense. It always performs exactly to design – but sometimes the design is inadequate. Replication and majority voting does not handle design flaws and errors of e.g. interaction, which are increasingly common the more softwareified a product is.
Thus we need more complicated ways to build fault tolerance into systems where we have to assume that the design is wrong somewhere, and we cannot just replicate ourselves out of the problem.