> Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”. [...] Can anyone enlighten me?
A big difference is that formal methods allow you to use the "for all" quantifier.
For example, you might write a unit test that says "foo('abc') returns a string with no trailing whitespace".
But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".
This is a trivial example, but you could imagine something more complicated, such as "for any program P, compile(P) has the same behavior as P".
Of course, you have to define what "has the same behavior as" means!
>Of course, you have to define what "has the same behavior as" means
And that's really my issue, for example when you define "has no trailing whitespace", you are basically writing a piece of the implementation. Cover all behaviors, and you have basically re-implemented the function, no?
In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?
> In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?
In some cases, however quite often, the spec is much simpler. For instance, it's easy to say that after running sort on some list, that the result is sorted. However, it is very hard to come up with the algorithm to do that from the specification. Sometimes that is even a point. Bubble sort, quick sort, tim sort, we can go on and on. There's a huge number of different sorts that computer science have discovered over the years. They all should have the same result and so you should be able to prove they do the same thing. However, in the real world there are often reasons you would prefer one to another despite all meeting the same spec.
Another obvious example are cryptographic hash functions: if you have a function f(s) = h, you can trivially specify a function inverse_f(h) = s st f(s) = h, and if you can infer a non-brute-force algorithm for that, you’ve just inferred a cryptographical weakness!
I did some of this stuff in college and what bugged me was that the spec actually ended up more complex than the code and it had bugs.
That was a long time ago and they said that formal methods were the future back then, too.
It's possible for that to happen but probably means either the function is too trivial or you're missing some abstraction in the spec
It's also possible it's not the future of programming.
It never was. It's useful but extremely slow.
This makes a lot of sense, thank you!
I think the key is that, while you may think you have a full formal spec of f(), you actually do not. You have a program written in some language, and that language has its own spec, and the language is compiled to asm which has its own spec, and the asm executes on an architecture that has its own spec, and so on.
So when you write a function like:
You might think you have "fully specified" hypot, but this is far from true! You have said nothing about what registers will be used, for example. This is not a problem; quite the opposite. It's the whole point of using high-level languages: they let you focus on what you care about. A spec is just a program in a very-high-level language.Exactly.
You can prove that f(a,b) always returns a+b.
And then you overflow the int on that machine.
Oops.
I am not sure that the perspective you have taken is the same as what I understood from the parent post; what I took from it is that things like registers, memory locations, ways to implement square root, and so on, are all _implementation choices_ that are not important properties of the specification. You specify that the hypotenuse is the square root of the sum of squares, but whether square root is implemented using Newtonian approximation or a fast inverse square root is irrelevant; whether your first argument is passed in a register or on the stack is irrelevant.
Often, things like resource usage are not specified: running time, memory consumption, etc, aren't relevant enough to appear in a behavioural specification.
If your spec says "f(a, b) returns a + b", but it's just a high-level document you can use to help guide your implementation, integer overflow is just one of many ways your implementation might be inconsistent with the specification. It's still likely that the existence of a formal specification you reference during implementation means that more edge cases have been considered ahead of time than if you just had an informal spec.
If, on the other hand, you prove it but it turns out not to be true (ie, you overflow integers), your proof is wrong. If a machine verified your proof and gave you a big thumbs up, your machine verification is wrong.
If, in Idris, I write "f : (a : Nat) -> (b : Nat) -> (c : Nat * c = a + b)", then I cannot compile an implementation for which I can't show a proof that the result is _always_ the addition of a and b, for all a and b, unbounded by anything but the resources at hand with which to run the program. An implementation subject to integer overflow won't compile.
Or, I could write "f : (a : Bits32) -> (b : Bits32) -> (c : Bits32 * c = a + b)" and implement something where , but then modulo arithmetic on overflow is _part of the specification_, because "+" in there is doing the heavy lifting of being defined as addition modulo 2*32 already; by specification, 4 billion plus 4 billion is ~3.7 billion.
Perhaps I have misunderstood or was unclear.
Everything the parent comment mentioned were implementation details that did not affect the correctness of the code.
I just wanted to point out that there are implementation details that DO affect the correctness of the code.
And, of course programs need to run on multiple architectures. So it's hard to do what people seemed to be talking about in this thread and verify code just from the source code.
If you have the luxury of proving the correctness of the CPU, compiler and OS, that should be a big win. Otherwise, it seems to just be another type of testing. Still useful, but calling it verified or proven seems a bit much.
From my perspective, it seems more to be writing another, more complicated program, with more opportunities for bugs, and seeing if the results agree
Well, ideally you proof should've failed because your axioms should be aware that arithmetic operators in the programming language of your choice are operating modulo some large power of two, and thus don't produce correct results for some particular inputs.
Admittedly, this one peeves a lot. Remember when Java's binary search and mergesort (and implementations in many other languages' standard libraries) turned out to have a bug of this kind [0]? Admittedly, the proof was informal, but if you are trying to prove some properties of a program X written in a certain programming language Y, you can't "just assume" that the semantics of Y are different from what they are, right? The integers do overflow in Java, that's explicitly stated in the Java's spec... and that means that a lot of even the most simplistic code has some very non-obvious correctness preconditions, which most of the times, I believe, are simply hoped to be true.
[0] https://research.google/blog/extra-extra-read-all-about-it-n...
Non trailing whitespace means the string doesn't end with a space. But foo is a function that converts an AST to a string, that's totally different. Or it's a function that loops until \0 and changes all spaces to +
The spec should be a summary of what the impl is supposed to do. You'd want more than just doesn't end with whitespace of course.
> Non trailing whitespace means the string doesn't end with a space
No. And this is a great example of the problems with specifications. You still have to write a spec. And this, too, is subject to bugs.
What's wrong with the statement above is there are 17 space characters in Unicode and another eight whitespace characters, like newline.
If you try to verify that something ends in whitespace, you have to make sure you have the right definition.
Not picking on parent poster! It's just a great example of the fact that you can verify, but if what you are verifying is wrong, it doesn't help.
> there are 17 space characters in Unicode and another eight whitespace characters, like newline.
And of course, those 25 characters don't include ZERO WIDTH {SPACE,NON-JOINER,JOINER,NON-BREAKING SPACE} and WORD JOINER, which gives you yet another 5 arguably "it's kinda space, right" codepoints which definitely should not be trailing in any reasonable text string.
Unicode never ceases to amaze me.
> But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".
Isn't that essentially property testing?
Formal methods allow you to prove that it works for all inputs, and not just for the small subset that will be sampled by property testing
It’s a proof, not a successful experiment.
Property testing is stochastic, which may be fine, but only gives you a statistically (hopefully) high chance of discovering a problem. If you use something like SPARK/Ada, you can actually embed a proof in the code so that you actually know that the code is correct (for what you've proven). PBT scales better than embedding proofs, though, and is highly effective in practice along with fuzzing.
Unless you literally try every possible combination of inputs (which is usually infeasible), property testing can't give you mathematical guarantees about correctness. You can think of it as a halfway house between classic testing and formal verification:
Classic testing: A human comes up with some concrete example inputs for which they know the "right answers" (corresponding outputs). They write code that runs the code under test, gets its actual outputs, and compares them to the desired outputs.
Property testing: A human comes up with a precise way of randomly generating concrete example (input, desired output) pairs. They write some code to describe how to generate the pairs, often using a declarative DSL that describes only constraints on the inputs and outputs, with the understanding that anything not expressly forbidden is permitted, like "The input can be any list of between 0 and 100 integers each between -500 and 500" and "Every integer in the input must appear the same number of times in the output". They then write some more code (often a single line) to ask the computer to use this "spec" to randomly generate, say, 1000 such pairs, or as many pairs as can be checked in 1s. The computer generates the pairs itself, runs the code under test on each input and and checks its output matches the desired output.
Formal verification: A human comes up with a spec that typically describes conditions that must hold for all (input, output) pairs. This may look very similar to, or even exactly like the DSL used for property testing, though in general there are other conditions that can be expressed that cannot be checked with property testing even in principle -- for example, checking that the program always eventually terminates. The main difference is that the code under test is never actually run; instead, the computer analyses the source code itself to attempt mathematically prove that the stated conditions hold. How to actually accomplish this is a field of active research, but one basic approach is called "symbolic execution". To greatly simplify, if we forget about loops and conditionals for a moment, the idea is that we can write down things we know must be true after each statement executes, based on the things we knew must be true before it executed. So for example if x is a variable initially containing any integer (and we ignore overflow) then after the line
runs, we know that x >= 0. To handle conditionals like the prover "forks" into two cases: One in which we know for certain that x > 50, one in which we know for certain that x <= 50. At the end of the if statement it then has the task of recombining what is known about the two cases. In this example, the first case lets us conclude that x = 42 by the end, while the second case lets us conclude that x <= 50 by the end, so it could conclude that x <= 50 either way by the time execution reaches something_afterwards(x). Handling loops is trickier but generally involves looking for invariants.