I actually want to push back a bit on this. I don’t have the context on what the author has done, but proof engineering is a bit different from software engineering. In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress. You don’t have “last minute surprises” which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above). But there’s also a much clearer sense that if the “contours” of the proof plan are correct, filling them in is much more parallelizable and more straightforward than with programming.

I think partially this is due to the language being pure (so there’s no surprising side effects, global state, or “time”) and partially due to the “if it typechecks, you’re done” nature of proofs. Also mathematics is (unsurprisingly) composable — more than run-of-the-mill software development.

The article contradicts this:

> There’s often a push-pull between these tasks. To give a recent example, suppose I want to prove that a function is associative (task 4). I realize that this theorem is false—oops—because of the underlying type that I've chosen (task 5). Maybe this is a problem I can avoid by decomposing other theorems differently (task 3). But in this case, I realize it’s a hard requirement, and I need to replace the current type with something else—for example, a finite map—and then refactor the whole formalization accordingly (task 2). I also need to check whether my theory is mathematically correct, or if I’ve made some conceptual mistake which renders important theorems unprovable (task 1).

> My most frustrating experiences with Lean have rarely involved proofs of individual theorems. More often, some previous decision has unexpectedly blocked me, and I’m facing a choice between several different complex refactors to fix it. This is what happened with the example above—in the end, I decided to rewrite everything with finite maps, which meant rewriting several definitions and proofs.

> “if it typechecks, you’re done” nature of proofs.

This resonates with me.

Software engineering is more art than math -- you have to deal with corner cases, exceptions, human interactions. So much of code encodes culture, preferences, storytelling, etc. It's kind of a "procedural epistemology". You are codifying -- without correctness -- the behaviors that you know.

Pure Math doesn't work like that -- it is more declarative and is built on a structure of correct statements. (numerical math is a different animal -- it's closer to engineering -- but proofs are pure math). It's way more limited and restrictive and rigorous than imperative code.

It's like writing in Haskell. If it compiles in Haskell, the result might not be useful in real life, but there's a degree of consistency that languages like Python cannot enforce.

But even Haskell for practical reasons only aims at consistency rather than correctness. Proof assistants like Lean and Coq go all the way in Curry-Howard correspondence.

I don't really see there being much of a difference between the modeling surprises and last-minute surprises only uncovered when the program runs. If the gist of what you're saying is that an experienced proof engineer should have fewer modeling surprises than an experience software engineer has last-minute code execution surprises, then I can sort of agree. But the "contours" of a proof you're describing are basically analogous to decomposing an application into individual components (modules/functions) --- and if you do this and test your components you should expect your software to not have any "last minute surprises" except for modeling failures as well.

I guess the point I mostly disagree with is the feasibility of what you describe here:

> In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress.

This presupposes that you know how the proof should be written, which means that you not only have an understanding of the proof mathematically (whatever your definition of "mathematically" is), but you ALSO know how to warp that understanding into a machine proof. The author notes that the "getting the definitions right" part is slower than them writing out the definitions themselves for a variety of reasons which ultimately mean they had to carefully comb through the definitions to make sure there were no logic errors.

So I don't really disagree with your overall point (which is that the author can probably be trusted to say they are actually at 50%), but I think you oversell the assurances granted by formal verification when compared to software engineering.

It’s not necessarily smooth sailing for sure. But in my (limited!) experience, there is a noticeable difference, so that’s what I wanted to note.

There indeed can be a gap between “it works in my head” and convincing the machine to do it, and sometimes the gap is wide. For example, type-theoretic way of looking a things (as customary in Lean) might not match the idioms in the paper proof, or some crucial lemmas known to be true might have never been formalized. Or the tactics to prove some piece are getting too slow. Or you got the definitions wrong or they’re annoying to deal with. So that is its own set of difficulties.

Even with that, the nature of difficulties feels different from much of software work to me. In software, I find that “divide and conquer” rarely works and I need to “grow” the program from a primitive version to a more sophisticated one. But in proofs, “divide and conquer” actually sort of works — you gain some assurances from making a high-level proof with holes in the middle, then filling in those holes with more intermediate steps, and so on. You can actually work “top down” to a meaningful extent more than in software.

To put it another way, with proofs, you can work top-down and bottom-up at the same time, making real progress while a bunch of steps are stubbed out. In programming, “stubbing out” non-trivial parts of the program is often annoying or impossible, and you can only make progress on a smaller (but complete) piece of it, or a less ambitious (but still complete) piece of it. This is because you can’t run an incomplete program, and there’s a set of knowledge you only gain from running it. But you absolutely can check an incomplete proof, with arbitrary granularity of incompleteness.

So I agree with you it’s not necessarily linear. But it’s also not exactly like software, and analogies from software work don’t apply 1:1.

Your comment basically perfectly describes why Scala is a wonderfully pleasant language to work with. I could review code and know that if it had the right high level steps and if it compiled, it was probably right. When writing code you figure out the high level steps and then write whatever you need to make it type check. If you don't go out of your way to do something perverse, it's probably correct.

> You don’t have “last minute surprises”

Not sure about that. One of the steps in your plan may be surprisingly hard to prove. You're going to solve the easier steps in your plan first and then may get bogged down at 98%.

Sure, which is why the words you copied are followed by these in my comment:

>… which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above)

If you already have a paper proof (which is the case in the original post), and you know it’s correct, then you already know all steps are provable — but if they aren’t translated correctly into the theorem prover’s model, or if some details don’t translate well, or if you’re missing some existing result that hasn’t been formalized, you can get stuck. AI doesn’t make that worse though.

> If you already have a paper proof (which is the case in the original post), and you know it’s correct

You may think it's correct until you realize that your argument transitively depends on some obscure results from the 70s that nobody had ever questioned but that turned out to be flawed. Probably salvageable, but now you suddenly need to prove something completely different that you might not even understand very well. There's an interview with Kevin Buzzard somewhere where he talks about how that happened to them (couldn't find a reference rn).

To formalize: After a few days 90% works, after a few more days 90% of the remaining 10% works, after a few more days.... long story short, it never works.