Hacker News

>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

I initially found this odd too. However, I think the catastrophic failure probability is the same as the prior system, and presumably this new design offers improvements elsewhere.

Under the 3-voting scheme, if 2 machines have the same identical failure -- catastrophe. Under the 4 distinct systems sampled from a priority queue, if the 2 machines in the sampled system have the same identical failure -- catastrophe. In either case the odds are roughly P(bit-flip) * P(exact same bit-flip).

The article only hints at the improvements of such a system with the phrasing: " simplifies the complex task", and I'm guessing this may reduce synchronization overhead or improve parallelizability. But this is a pretty big guess to be fair.

guai888 18 hours ago [ - ]

These CPUs are typically implemented as lockstep pairs on the same die. In a lockstep architecture, both CPUs execute the same operations simultaneously and their outputs are continuously compared. As a result, the failure rate associated with an undetected erroneous calculation is significantly lower than the FIT rate of an individual CPU.

Put another way, the FIT (Failure in Time) value for the condition in which both CPUs in a lockstep pair perform the same erroneous calculation and still produce matching results is extremely small. That is why we selected and accepted this lockstep CPU design

CubicalOrange 15 hours ago [ - ]

the probability of simultaneous cosmic ray bit-flip in 2 CPUs, in the same bit, is ridiculously low, there might be more probability of them getting hit by a stray asteroid, propelled by a solar flare.

but still, murphy's law applies really well in space, so who knows.

randomNumber7 11 hours ago [ - ]

For errors due to radiation the probability is extremely low, since it would need to flip the same bit at the same time in two different places.

sippeangelo 11 hours ago [ - ]

Then why 8 instead of 3?

randomNumber7 8 hours ago [ - ]

They know their developers and engineers suck almost as hard as their management decisions so they added some more redundancy.

anordin95 6 hours ago [ - ]

alfons_foobar 15 hours ago [ - ]

I wondered about this as well.

OTOH, consider that in the "pick the majority from 3 CPUs" approach that seems to have been used in earlier missions (as mentioned in the article) would fail the same way if two CPUs compute the same erroneous result.

FabHK 16 hours ago [ - ]

Indeed. It seems like system 1 and 2 could fail identically, 3, 4, 5, 6, 7, 8 are all correct, and as described the wrong answer from 1 and 2 would be chosen (with a "25% majority"??).

themafia 18 hours ago [ - ]

In the Shuttle they would use command averaging. All four computers would get access to an actuator which would tie into a manifold which delivered power to the flight control surface. If one disagreed then you'd get 25% less command authority to that element.

JumpCrisscross 15 hours ago [ - ]

> In the Shuttle they would use command averaging

I think the Shuttle, operating only in LEO, had more margin for error. Averaging a deep-space burn calculation is basically the same as killing the crew.

Cthulhu_ 11 hours ago [ - ]

Sure, but these maneuvers aren't done realtime and aren't as time-sensitive; a burn is calculated and triple checked well in advance. If there was an error, there's always time to correct it.

In the case of moon landings, the only truly time-critical maneuvers are the ones right before landing... and unfortunately, a lot of fairly recent moon probes have failed due to incorrect calculations, sensor measurements, logic errors, etc.

themafia 15 hours ago [ - ]

The GNC loop runs several times per second. The desired output will consequently be increased by the working computers to achieve the target. The computer does not "dead reckon" anything.

Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous.

14 hours ago [ - ]

[deleted]

JumpCrisscross 14 hours ago [ - ]

> Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous

Fair enough. I don't know enough about Orion's architecture to guess at propellant reserves, and how life-or-death each burn actually is.