No hardware failure is considered? No cosmic rays flipping bits? No soft or hard real-time guarantees are discussed? What about indeterminate operations that can fail such as requesting memory from some operating system dynamically?
I'm asking because I thought high integrity systems are generally evaluated and certified as a combination of hardware and software. Considering software alone seems pretty useless.
Specifications that are formally verified can definitely cover real-time guarantees, behaviour under error returns from operations like allocation, and similar things. Hardware failures can be accounted for in hardware verification, which is much like software verification: specification + hardware design = verified design; if the spec covers it, the verification guarantees it.
Considering software alone isn't pretty useless, nor is having the guarantee that "inc x = x - 1" will always go from an Int to an Int, even if it's not "fully right" at least trying to increment a string or a complex number will be rejected at compile time. Giving up on any improvements in the correctness of code because it doesn't get you all the way to 100% correct is, IMO, defeatist.
(Giving up on it because it has diminishing returns and isn't worth the effort is reasonable, of course!)
Hardware verification doesn't prevent hardware failures. There is a reason RAM comes with ECC. It's not because RAM designers are too lazy to do formal verification. Even with ECC RAM, bit flips can still happen if multiple bits flip at the same time.
There are also things like CPUs taking the wrong branch that occasionally happen. You can't assume that the hardware will work perfectly in the real world and have to design for failure.
Well of course hardware fails, and of course verification doesn't make things work perfectly. Verification says the given design meets the specification, assumptions and all. When the assumptions don't hold, the design shouldn't be expected to work correctly, either. When the assumptions do hold, formal verification says the design will work correctly (plus or minus errors in tools and materials).
We know dynamic RAM is susceptible to bit-flip errors. We can quantify the likelihood of it pretty well under various conditions. We can design a specification to detect and correct single bit errors. We can design hardware to meet that specification. We can formally verify it. That's how we get ECC RAM.
CPUs are almost never formally verified, at least not in full. Reliability engineering around systems too complex to verify, too expensive to engineer to never fail, or that might operate outside of the safe assumptions of their verified specifications, usually means something like redundancy and majority-rules designs. That doesn't mean verification plays no part. How do you know your majority-rules design works in the face of hardware errors? Specify it, verify it.
Designing around hardware failure in software seems cumbersome to insane. If the CPU can randomly execute arbitrary code because it jumps to wherever, no guarantees apply.
What you actually do here is consider the probability of a cosmic ray flip, and then accept a certain failure probability. For things like train signals, it's one failure in a billion hours.
> Designing around hardware failure in software seems cumbersome to insane.
Yet for some reason you chose to post this comment over TCP/IP! And I'm guessing you loaded the browser you typed it in from an SSD that uses ECC. And probably earlier today you retrieved some data from GFS, for example by making a Google search. All three of those are instances of software designed around hardware failure.
But you must drive a line somewhere.
If "a cosmic ray could mess with your program counter, so you must model your program as if every statement may be followed by a random GOTO" sounds like a realistic scenario software verification should address, you will never be able to verify anything ever.
I agree, you definitely won't be able to verify your software under that assumption; you need some hardware to handle it, such as watchdog timers (when just crashing and restarting is acceptable) and duplex processors like some Cortex-R chips. Or TMR.
An approach that has been taken for hardware in space is to have 3 identical systems running at the same time.
Execution continues while all systems are in agreement.
If a cosmic ray causes a bit-flip in one of the systems, the system not in agreement with the other two takes on the state of the other two and continues.
If there is no agreement between all 3 systems, or the execution ends up in an invalid state, all systems restart.
>Designing around hardware failure in software seems cumbersome to insane
I mean there are places to do it. For example ZFS and filesystem checksums. If you've ever been bit by a hard drive that says everything is fine but returns garbage you'll appreciate it.
Yet, big sites like Google or TikTok constantly deal with hardware failures everyday while keeping their services and apps running.
Even then you have other physical issues to consider. This is one of the things I love about the Space Shuttle. It had 5 computers for redundancy during launch and return. You obviously don't want to put them all in the same place so you spread them out among the avionics bays. You also obviously don't want them all installed in the same orientation so you install them differently with respect to the vehicles orientation. You also have a huge data network that requires redundancy and you take all the same steps with the multiplexers as well.
The best example on the Shuttle were the engine control computers. Each engine had two controllers, primary and backup, each with its own set of sensors in the engine itself and each consisting of a lock-step pair of processors. For each engine, the primary controller would use processors built by one supplier, while the backup would use processors of the same architecture but produced by an entirely different supplier (Motorola and TRW).
Today, even fairly standard automotive ECUs use dual-processor lock-step systems; a lot the the Cortex-R microcontrollers on the market are designed around enabling dual-core lock-step use, with error/difference checking on all of the busses and memory.
But what do you do when the primary and backup disagree?
This question is posed in Segal’s Law:
A man with a watch knows what time it is. A man with two watches is never sure.
https://en.wikipedia.org/wiki/Segal%27s_law
Requiring specialized hardware seems overly strict now that we can handle such things at a higher level via something like the fault tolerant lambda calculus.
Nowhere does the article claim that:
Formal verification is simply a method of ensuring your code behaves how you intend.Now, if you want to formally verify your program can tolerate any number of bits flip on any variables at any moment(s) in time, it will happily test this for you. Unfortunately, assuming presently known software methods, this is an unmeetable specification :)
For portable libraries and apps, there's only so much you can do. However, there are some interesting properties you can prove assuming the environment behaves according to a spec.
The article does consider hardware failure, yes.
Side channels? Is best out of 2 sufficient or is best out of 3 necessary?
From https://news.ycombinator.com/context?id=39938759 re: s2n-tls:
> [ FizzBee, Nagini, Deal-solver, Dafny; icontract, pycontracts, Hoare logic, DbC Design-by-Contract, invariants, parallelism and concurrency and locks, io latency, pass by reference in distributed systems, "FaCT: A DSL for Timing-Sensitive Computation" and side channels [in hw and software] https://news.ycombinator.com/item?id=38527663 ]
There are so many things to consider;
/? awesome-safety https://westurner.github.io/hnlog/#search:awesome-safety :
awesome-safety-critical: https://awesome-safety-critical.readthedocs.io/en/latest/
Hazard (logic) https://en.wikipedia.org/wiki/Hazard_(logic)
Hazard (computer architecture); out-of-order execution and delays: https://en.wikipedia.org/wiki/Hazard_(computer_architecture)
Soft error: https://en.wikipedia.org/wiki/Soft_error
SEU: Single-Event Upset: https://en.wikipedia.org/wiki/Single-event_upset
And then cosmic ray and particle physics