Hacker News

kqr 3 days ago [ - ]

In safety-critical systems, we distinguish between accidents (actual loss, e.g. lives, equipment, etc.) and hazardous states. The equation is

hazardous state + environmental conditions = accident

Since we can only control the system, and not its environment, we focus on preventing hazardous states, rather than accidents. If we can keep the system out of all hazardous states, we also avoid accidents. (Trying to prevent accidents while not paying attention to hazardous states amounts to relying on the environment always being on our side, and is bound to fail eventually.)

One such hazardous state we have defined in aviation is "less than N minutes of fuel remaining when landing". If an aircraft lands with less than N minutes of fuel on board, it would only have taken bad environmental conditions to make it crash, rather than land. Thus we design commercial aviation so that planes always have N minutes of fuel remaining when landing. If they don't, that's a big deal: they've entered a hazardous state, and we never want to see that. (I don't remember if N is 30 or 45 or 60 but somewhere in that region.)

For another example, one of my children loves playing around cliffs and rocks. Initially he was very keen on promising me that he wouldn't fall down. I explained the difference between accidents and hazardous states to him in childrens' terms, and he realised slowly that he cannot control whether or not he has an accident, so it's a bad idea to promise me that he won't have an accident. What he can control is whether or not bad environmental conditions lead to an accident, and he does that by keeping out of hazardous states. In this case, the hazardous state would be standing less than a child-height within a ledge when there is nobody below ready to catch. He can promise me to avoid that, and that satisfies me a lot more than a promise to not fall.

jacquesm 3 days ago [ - ]

If you haven't done so: please write a book. Aim it towards software professionals in non-regulated industries. I promise to buy 50 to give to all of my software developing colleagues.

As for 'N', for turboprops it is 45, for jets it is 30.

kqr 3 days ago [ - ]

I want to write more about this, but it has been a really difficult subject to structure. I gave up halfway through this article, for example, and never published it – I didn't even get around to editing it, so it's mostly bad stream of consciousness stuff: https://entropicthoughts.com/root-cause-analysis-youre-doing...

I intend to come back to it some day, but I do not think that day is today.

sans_souse 3 days ago [ - ]

Just started reading the linked text after reading your comment and I agree, this is high quality education, and enjoyable. It's an art, really. Thank you for sharing your work and please keep it up.

Just a thought I had while reading your introduction: this is applicable even to running a successful business model. I'm honestly having trouble even putting it into words, but you have my analytical mind going now at a very late hour... Thanks!

jacquesm 2 days ago [ - ]

Ok. I am impressed with your ability to take such complex subjects and make them plain, you are delivering very high quality here. The subject is absolutely underserved in the industry as far as I'm aware of it, and I would love to have a book that I can hand out to people working on software in critical infrastructure and life sciences that gets them up to speed. The annoying thing is that software skills are values much higher than the ability to accurate model the risks because that is only seen as a function of small choices standing by themselves. A larger, overall approach is what is very often called for and it would help to have a tool in hand to both make that case and to give the counterparty the vocabulary and the required understanding of the subject in order to have a meaningful conversation.

Edit: please post your link from above as a separate submission.

hengheng 2 days ago [ - ]

Write it as a children's book. A literal ELI5.

(Knowing, of course, that it will still be read mainly by engineers. But that's the charm.)

MandieD 2 days ago [ - ]

I have a rather over-confident five year old, so would LOVE that book right now.

smcin 2 days ago [ - ]

Your writing is good, please keep at it. I think it would help a lot if you made it clearer when you're talking between root-cause-analysis for software, aviation, other things, or generically.

Also, your train-of-thought is pretty deep; bulleting runs out of steam and gets visually confusing, especially with the article table-of-contents on RHS, you're only using <50% of screen width. Suggest you need numbered/lettered lists and section headings and use the full screen width.

ta988 2 days ago [ - ]

Thanks, I would buy your book. But I understand the effort necessary all too well.

xeonmc 3 days ago [ - ]

If he aims it toward five year olds as he had explained it, bet it would be even more applicable to our profession.

abustamam 2 days ago [ - ]

Having spent some time with my five year old nieces and nephews, sometimes I wonder if five year olds could run companies better.

(note: obviously sarcastic but kids really do have some amazing insights that we forget when trying to chase KPIs or revenue)

xeonmc 2 days ago [ - ]

See also: various points in the Evil Overlord list[0]. Selected examples:

    #12: One of my advisors will be an average five-year-old child. Any flaws in my plan that he is able to spot will be corrected before implementation.
    #60: My five-year-old child advisor will also be asked to decipher any code I am thinking of using. If he breaks the code in under 30 seconds, it will not be used. Note: this also applies to passwords.
    #74: When I create a multimedia presentation of my plan designed so that my five-year-old advisor can easily understand the details, I will not label the disk "Project Overlord" and leave it lying on top of my desk.

[0] https://tvtropes.org/pmwiki/pmwiki.php/Main/EvilOverlordList

abustamam 2 days ago [ - ]

I'd never seen that list before but it's hilarious!

DocTomoe 3 days ago [ - ]

Seconded.

That being said: I have - for some years now - started to read air accident board reports (depending on your locale, they may be named slightly different). They make for a fascinating read, and they have made me approach debugging and postmortems in a more structured, more holistic way. They should be freely available on your transportation safety board websites (NTSB in America, BFU in Germany, ...)

ratorx 3 days ago [ - ]

Google’s SRE STPA starts with a similar model. I haven’t read the external document, but my team went through this process internally and we considered the hazardous states and environmental triggers.

https://sre.google/stpa/teaching

Disclaimer: currently employed by Google, this message is not sponsored.

edanm 3 days ago [ - ]

Seconded! This was an extremely well written and well thought out explanation of this idea. Would love to read more along these lines.

(Will now be checking out your blog.)

jacquesm 3 days ago [ - ]

Also check out risks digest:

https://catless.ncl.ac.uk/Risks/

hotelbet7 3 days ago [ - ]

[dead]

AnthonyMouse 3 days ago [ - ]

> Trying to prevent accidents while not paying attention to hazardous states amounts to relying on the environment always being on our side, and is bound to fail eventually.

The reason they had less than 30 minutes of fuel was because the environment wasn't on their side. They started out with a normal amount of reserve and then things went quite badly and the reserve was sufficient but only just.

The question then is, how much of an outlier was this? Was this a perfect storm that only happens once in a century and the thing worse than this that would actually have exhausted the reserve only happens once in ten centuries? Or are planes doing this every Tuesday which would imply that something is very wrong?

kqr 3 days ago [ - ]

This is why staying out of hazardous conditions is a dynamic control problem, rather than a simple equation or plan you can set up ahead of time.

There are multiple controllers interacting with the system (the FADEC computer in the engines, the flight management computer in the plane, pilots, ground crew, dispatchers, air traffic controllers, the people at EASA drafting regulations, etc.), trying to keep it outside of hazardous conditions. They do so by observing the state the system and the environment is in ("feedback"), running simulations of how it will evolve in the future ("mental models"), and making adjustments to the system ("control inputs") to keep it outside of hazardous conditions.

Whenever the system enters a hazardous condition, there was something that made these controllers insufficient. Either someone had inadequate feedback, or inadequate mental models, or the control inputs were inoperational or insufficient. Or sometimes an entire controller that ought to have been there was missing!

In this case it seems like the hazard could have been avoided any number of ways: ground the plane, add more fuel, divert sooner, be more conservative about weather on alternates, etc. Which control input is appropriate and how to ensure it is enacted in the future is up to the real investigators with access to all data necessary.

-----

You are correct that we will not ever be able to set up a system where all controllers are able to always keep it out of hazardous states perfectly. If that was a thing we would never have any accident ever – we would only have intentional losses that are calculated to be worth their revenue in additional efficiency.

But by adopting the right framework for thinking about this ("how do active controllers dynamically keep the system out of hazards?") we can do a pretty good job of preventing most such problems. The good news is that predicting hazardous states is much easier than predicting accidents, so we can actually do a lot of this design up-front without first having an accident happen and then learning from it.

thaumasiotes 2 days ago [ - ]

> This is why staying out of hazardous conditions is a dynamic control problem

I don't think this philosophy can work.

If you can't control whether the environment will push you from a hazardous state into a failure state, you also can't control whether the environment will push you from a nonhazardous state into a hazardous state.

If staying out of hazardous conditions is a dynamic control problem requiring on-the-fly adjustment from local actors, exactly the same thing is true of staying out of failure states.

The point of defining hazardous states is that they are a buffer between you and failure. Sometimes you actually need the buffer. If you didn't, the hazardous state wouldn't be hazardous.

But the only possible outcome of treating entering a hazardous state as equivalent to entering a failure state is that you start panicking whenever an airplane touches down with less than a hundred thousand gallons of fuel.

cyphar 2 days ago [ - ]

My understanding is that the SOP for low fuel is that you need to declare a fuel emergency (i.e., "Mayday Mayday Mayday Fuel") one you reach the point where you will land with only reserve fuel left. The point OP was making is that the entire system of fuel planning is designed so that you should never reach the Mayday stage as a result of something you can expect to happen eventually (such as really bad weather). If you land with reserve fuel, it is normally investigated like any other emergency.

Flight plans require you to look at the weather reports of your destination before you take off and pick at least one or two alternates that will let you divert if the weather is marginal. The fuel you load includes several redundancies to deal with different unexpected conditions[1] as well as the need to divert if you cannot land.

There have been a few historical cases of planes running out of fuel (and quite a few cases of planes landing with only reserve fuel), and usually the root cause was a pilot not making the decision to go to an alternate airport soon enough or not declaring an emergency immediately -- even with very dynamic weather conditions you should have enough fuel for a go-around, holding, and going to an alternate.

[1]: https://www.casa.gov.au/guidelines-aircraft-fuel-requirement...

janc_ 2 days ago [ - ]

Landing at an alternate location is significantly more expensive, so I assume Ryanair put pressure on its pilots to avoid that…?

cyphar a day ago [ - ]

We'll find out in the investigation, but "get-there-itis" is a very common condition amongst pilots and can lead to them delaying making decisions (such as going to alternates) so it's possible that this happened without explicit (or implicit) pressure from management.

That being said, the fact that (AFAICS) they first tried to divert to a closer airport where the weather was similar rather than an alternate with clear weather was probably one of the causes of this event.

ndsipa_pomu 3 days ago [ - ]

That's very enlightening. I'm casually interested in traffic safety and road/junction designs from the perspective of a UK cyclist and there's a lot to be learnt from the safety culture/practices of the aviation industry. I typically think in terms of "safety margins" whilst cycling (e.g. if a driver pulls out of a side road in front of me, how quickly can I avoid them via swerving or brake to avoid a collision). I can imagine that hazardous states can be applied to a lot of the traffic behaviour at junctions.

jama211 3 days ago [ - ]

Well said, will think about asking this attitude towards my child, seems very helpful