This article’s framing and title are odd. The author, in fact, found no bugs or errors in the proven code. She says so at the end of the article:

> The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct.

Still an interesting and useful result to find a bug in the Lean runtime, but I’d argue that doesn’t justify the title. Or the claim that “the entire proof edifice” is somehow shaky.

It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to, since obviously no one’s running any compiled Lean code in any kind of production hot path.

[1] https://lean-lang.org/doc/reference/latest/Elaboration-and-C...

Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.

If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.

Second, there was a bug in the code. Maybe not a functional correctness bug, but I, along with many and most end users, would consider a crashing program buggy. Maybe we just have different tastes or different standards on what we consider an acceptable level of software quality.

W.r.t people running Lean in production, you'd be surprised...

We're not speaking about bugs in a verified system so much as writing articles making specific claims about that. Surely if we're at the level of precision of formal verification, it's incumbent upon us to be precise about the nature of a problem with it, no? "Lean proved this program correct and then I found a bug" heavily implies a flaw in the proof, not a flaw in the runtime (which to my mind would also be a compelling statement, for the reasons you describe).

Or it implies a bug in the specification. The spec differing from the intent is a frequent source of bugs, it doesn't matter what language the spec is written in. Most people have experience with (or have seen news stories about) specification bugs in natural-language specifications: legal loopholes!

This is the biggest risk with the rejuvenated interest in formal proof. That LLMs can generate proofs is useful. Proof assistants that can check them (Lean/FStar/Isabelle/...) similarly so.

But it just moves the question to whether the theorems covered in the proof are sufficient. Underlying it all is a simple question:

Does the system meet its intended purpose?

To which the next question is:

What is the intended purpose?

Describing that is the holy grail of requirements specification. Natural language, behaviour-driven development, test-driven development and a host of other approaches attempt to bridge the gap between implicit purpose and explicit specification. Proof assistants are another tool in that box.

It's also one of the key motivators for iterative development: putting software in front of users (or their proxies) is still the primary means of validation for a large class of systems.

None of which is implied criticism of any of those approaches. Equally, none completely solves the problem. There is a risk that formal proofs, combined with proof assistants, are trumpeted as "the way" to mitigate the risk that LLM-developed apps don't perform as intended.

They might help. They can show that code is correct with respect to some specification, and that the specification is self-consistent. They cannot prove that the specification is complete with regards its intended purpose.

Sorry, I'm not sure I follow. We are talking about bugs in a verified system, that is, in this case, a verified implementation of a zlib-based compression tool. Did it have bugs? yes. Several in fact. I'd recommend reading the article for a detailed listing of the bugs in the tool.

You give the answers to this in the artcle:

> The most substantial finding was a heap buffer overflow! but, not in lean-zip's code, but in the Lean runtime itself. (emphasis mine)

> The OOM denial-of-service is straightforward: the archive parser was never verified. (me again)

"Lean proved this program correct" vs "I found bugs in the parts that Lean didn't prove was correct". The failure was not in Lean's proof (which as I said is heavily implied by the title), but in ancillary or unverified code.

Do I misunderstand how Lean works? I am by no means an expert (or even an amateur) on Lean or formal systems in general. Surely the first class of bug could be found in any code that uses the framework, and the second class of bug could happen to any system that isn't proven? Does that make sense? Otherwise where's the boundary? If you find a bug in the OS, does that mean Lean failed somehow? How about the hardware? If your definition of a 'formally verified system' goes beyond the code being verified and the most important thing is whether or not you can make it crash, then the OS and the hardware are also part of the system.

Of course bugs are important to users regardless of the cause, but the audience for your article seems to be software engineers and as a software engineer, your article was interesting and you found something useful, but the title was misleading; that's all I'm saying.

Further to my earlier reply - a more succinct way to look at it might be:

- When they fix the run time, bug A goes away. So the proof still holds and the zlib code is still correct.

- When they add a system of proofs for the parser and modify that, then bug B goes away. So the proof still holds and the zlib code is still correct; and now more of the library is proven correct.

The formulation of the title is "I was told X but that's not true"... but that's not true. You were told X, and X is true, but you found Y and Z which are also important.

What is the program?

There are two different answers to this question, and which one is "correct" depends entirely on the context of who is asking it.

1. It's the code that is specific to this program that sits above the run-time layer (internal view, that most programmers would take).

2. It's the code in the binary that is executed (external view, that most users would take).

The key question does not seem to be "was the proof correct", rather "did the proof cover everything in the program". The answer depends on whether you are looking at it from the perspective of a programmer, or a user. Given the overly strong framing that the article is responding to - highlighting the difference in this way does seem to be useful. The title is correct from the perspective that most users would take.

Yes but, without wishing to be snarky, did you read the article? There is no program as such, in either sense - the announcement from Lean only mentions "a C compression library" (zlib). Not only that, but since we're talking about formal verification, a programmer would likely understand that that is about proving a bounded, specific codebase at source code level, and not operating on a binary along with its associated dependencies (again caveat my limited understanding of these things).

My feeling is that if you told the average non-technical user that some person/organisation had produced a formally verified version of a C compression library, you would likely get a blank look, so I think it's reasonable to assume that both Lean's intended audience, and the audience of the blog post linked here, correspond with number 1. in your list.

The article describes fuzzing the library, this execution requires a program to be compiled. Typically fuzzing involves a minimal harness around the payload (a single call into the library in this case). There is clearly a bug in this program, and it does not exist in the minimal harness. It must be in the library code, which was covered by the proof.

The bounded, specific codebase that you refer to is typically the library *and all of its dependencies*, which in this case includes the Lean runtime. This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations. In this case the original gushing claim that everything was verified is incorrect and premature. The article seems like a good exposition of why.

Thank you, I understand what fuzzing is; that test harness was presumably provided either by the blog post author or generated by Claude somehow, and therefore would not have been part of the proven code, nor part of the original claim by the Lean devs. That's what I meant by saying there is no program as such.

> The bounded, specific codebase that you refer to is typically the library and all of its dependencies, which in this case includes the Lean runtime.

How does that work? I thought the main idea is to write code in the Lean language which has some specific shape conducive to mathematical analysis, along with mathematical proofs that operate on that code. How then does a system like this handle a third party dependency? I've searched around and I can't find any information about how it works. I assumed that the boundary of the proof was the source code - surely they can't also be proving things like, say, DirectX?

> This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations.

The difficulty is not in explosions of computational complexity due to problems of incompleteness, decidability, the halting problem, those kinds of things? As I said this is not something I know much about but it's surprising to me if 'analysing all the code' is really the difficult bit.

There are standard convex assumptions to handle incompleteness, decidability etc, i.e. the results are an over-approximation that terminates. Picking a approximation that is precise enough in the properties that you care about is part of the challenge, but it is an in-band problem. There are no hard edges between the theory and reality.

As with most engineering problems the out-of-band issues tend to be the hardest to solve. Models of the part underneath the interesting part need to be complete/accurate enough to make the results useful. Compare it to crypto where people do not usually try to break the scheme - they try to break the specific implementation of the scheme because the weakest points will be at the interface between the theoretical construction and the actual concrete instantiation of the device that it will run on.

Gentle reminder about this excerpt from HN Guidelines:

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

Noted, thank you

Say you study some piece of software. And it happens that it has an automated suite of tests. And it happens that some files aren't covered by the test suite. And you happen to find a bug in one of those files that were not covered.

Would you publish a blog post titled "the XXX test suite proved there was no bug. And then I found one"?

It would be a bit silly, right?

A test suite never proves anything except that the code works for the test cases in it.

But yes, it would be equivalent to stating "No tests failed but I found a bug" and one would correctly deduce that test coverage is insufficient.

These were my thoughts as well and it's nothing new, I think, regarding testing altogether:

- testing libraries (and in this case - language itself) can have bugs

- what is not covered by tests can have bugs

Additionally would add that tests verify the assumptions of coder, not expectations of the business.

To give benefit to the author - I'd read the article as: having tests for given thing ensures that it does the thing that you built the tests for. This doesn't mean that your application is free of bugs (unless you have 100% coverage, can control entire state of the system, etc) nor that it does the right thing (or that it does the thing the right way)

I like to differentiate between coverage by lines and semantic coverage: sometimes you need to exercise a single line multiple times to get full semantic coverage, and better semantic coverage usually beats larger line coverage for detecting problems and preventing regressions.

I think mutation testing helps in defining semantic coverage, if I understand what you're saying

You're technically right, but what things are versus how they're promoted or understood by most people (rightfully or not) often diverges, and therefore such "grounding" articles are useful, even if the wording addresses the perceived rather than the factual reality.

By way of analogy, if there was an article saying "I bought a 1Tb drive and it only came with 0.91 terabits", I think if you started explaining that technically the problem is the confusion between SI vs binary units and the title should really be "I bought a 0.91 terabit drive and was disappointed it didn't have more capacity", technically you'd be right, but people would rightfully eyeroll at you.

To be clear, I think the article was fine and the author did some useful work (finding a bug in the runtime of a supposedly provably correct system is indeed a valuable contribution!). I don't agree that it's pedantic to explain why the title feels like a bait-and-switch (and thus does a disservice to the article itself). It's just a bit of feedback for future reference.

I take some comfort from being technically correct though; it's widely accepted by all of us pedants that that is the best kind of correct ;-)

> when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.

I agree, but it’s not fair to imply that the verification was incorrect if the problem lies elsewhere.

This is a nice example of how careful you have to be to build a truly verified system.

But is fair to state that the verification was *incomplete*, which is what the article does.

The problem was in Lean though, so it seems fair.

> Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.

Yes, and that would be relevant if this was a verified software system. But it wasn't: the system consisted of a verified X and unverified Y, and there were issues in the unverified Y.

The article explicitly acknowledges this: "The two bugs that were found both sat outside the boundary of what the proofs cover."

the good news I guess are

1/ lean-zip is open source so it's much easier to have more Claude's eyes looking at it

2/ I don't think Claude could prove anything substantial about the zip algorithm. That's what lean is for. On the other side, lean could not prove much about what's around the zip algorithm but Claude can be useful there.

So in the end lean-zip is now stronger!

> I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.

Reminds of what some people in the Rust community do: they fight how safe this is or not. I always challenge that the code is composed of layers, from which unsafe is going to be one. So yes, you are righ to say that. Unsafe means unsafe, safe means safe and we should respect the meaning of those words instead of twisting the meaning for marketing (though I must say I heard this from people in the community, not from the authors themselves, ever).

Totally agreed. For instance that's why sel4 just throws the whole binary at the proof engine. That takes any runtime (minimal in sel4's case) and compiler (not nearly as minimal) out of the TCB.

Hi Kiran, thanks for following up. FWIW, I enjoy your blog and your work. And I do think it was a valuable bug you found; also nice to see how quickly Henrik fixed it.

Say more about people running Lean in production. I haven’t run into any. I know of examples of people using Lean to help verify other code (Cedar and Aeneas being the most prominent examples), but not the actual runtime being employed.

I took a quick scan of lean-lang.org just now, and, other than the two examples I mentioned, didn’t see a single reference to anything other than proving math.

I’m sure you’re in the Lean Zulup, based on what you’ve been up to. Are you seeing people talk about anything other than math? I’m not, but maybe I’m missing it.

Yes, here's a concrete example: https://github.com/leanprover/SampCert This is an implementation of a verified sampler, in lean. Not an embedding in some other language. The implementation itself is in lean, and a python ffi is used to call into the verified implementation. I don't know if AWS is big enough for your standards, but here is at least one example. Besides that, I'm more reporting on the general vibe I have observed from numerous talks at AI4maths workshops at Neurips, at the DARPA AI4Math ExpMath kickoff, etc. People are considering Lean as a serious programming language. Maybe that's surprising to the mathematicians, but as a PL person, I find the language really nicely designed and I can understand why people want to write in it.

You’re right, I should have been more careful in my reference to AWS. No need to be snarky about it.

Let me rephrase: aside from that one example from a couple years ago, I haven’t seen any examples of production code written in Lean. I’d be very interested in being proven wrong, this isn’t something I desire, just what I’ve observed. Have you seen any others?

More generally, you implicitly make a good point: writing important libraries in Lean and calling in from another language is probably the most likely use case. So not programs / apps / binaries written in Lean, but small critical components.

> if the software you were running was advertised as formally verified as free of bugs.

Nobody should be advertising that. Even ignoring the possibility of bugs in the runtime, there could also be bugs in the verifier and bugs or omissions in the specification. Formally verified never means guaranteed to be free of bugs.

As quoted in the article itself, please take it up with the chief architect of the Lean FRO:

> ... converted zlib (a C compression library) to Lean, passed the test suite, and then proved that the code is correct.

> Not tested. Proved. For every possible input. lean-zip

Yeah well he shouldn't. That looks like slop tbf.

Well then formally verify the language system. I’m not sure what the confusion is. They didn’t say the whole system is formally verified.

Yeah, the title made me think the author found a bug in the Lean kernel, thus making an invalid proof pass Lean's checks. The article instead uncovers bugs in the Lean runtime and lean-zip, but these bugs are less damning than e.g. the kernel, which must be trusted to be correct, or else you can't trust any proof in Lean.

When the Lean runtime has bugs, all Lean applications using the Lean runtime also have those bugs. I can’t understand people trying to make a distinction here. Is your intent to have a bug free application or to just show the Lean proof kernel is solid?? The latter is only useful to Lean developers, end users should only care about the former!

The intent is to have a proof of some proposition. The Lean runtime crashing doesn't stop the lean-zip developers from formally modelling zlib and proving certain correctness statements under this model. On the other hand, the Lean kernel having a bug would mean we may discover entire classes of proofs that were just wrong; if those statements were used as corollaries/lemmas/etc. for other proofs, then we'd be in a lot of trouble.

When I see a title transitioning from "Lean said this proof is okay" to "I found a bug in Lean", I'm intuitively going to think the author just found a soundness (or consistency) issue in Lean.

There are no Lean applications other than Lean. This is an important point most of the comments are missing. Lean is for proving math. Yes, you can use it for other things; but no, no one is.

Still good to have found, but drawing conclusions past “someone could cheat at proving the continuum hypothesis” isn’t really warranted.

A missing specification in the proof of lean-zip, a lean component, is a real problem to the philosophy and practice of software verification.

To illustrate, let's say you want to verify a "Hello world" program. You'd think a verification involves checking that it outputs "Hello, world!".

However, if a contractor or AI hands you a binary, what do you need to verify? You will need to verify that it does exactly print "Hello, world!", no more, no less. It should write to stdout not stderr. It shouldn't somehow hold a lock on a system resource that it can't clean up. It cannot secretly install a root-kit. It cannot try to read your credentials and send it somewhere. So you will need to specify the proof to a sufficient level of detail to capture those potential deviations.

Broadly, with both bugs, you need to ask a question: does this bug actually invalidate my belief that the program is "good"? And here you are pulling up a fact that the bug isn't found in the Lean kernel, which makes an assumption that there's no side-effect that bleeds over the abstraction boundary between the runtime and the kernel that affects the correctness of the proof; that safety guarantee is probably true 99.99% of the time - but if the bug causes a memory corruption, you'd be much less confident in that guarantee.

If you're really serious about verifying an unknown program, you will really think hard "what is missing from my spec"? And the answer will depend on things that are fuzzier than the Lean proof.

Now, pragmatically, there many ways a proof of correctness adds a lot of value. If you have the source code of a program, and you control the compiler, you can check the source code doesn't have weird imports ("why do I need kernel networking headers in this dumb calculator program?"), so the scope of the proof will be smaller, and you can write a small specification to prove it and the proof will be pretty convincing.

All in all, this is a toy problem that tells you : you can't verify what you don't know you should verify, and what you need to verify depends on the prior distribution of what the program is that you need to verify, so that conditional on the proof you have, the probability of correctness is sufficiently close to 1. There's a lesson to learn here, even if we deem Lean is still a good thing to use.

> A missing specification in the proof of lean-zip, a lean component, is a real problem to the philosophy and practice of software verification.

Every time someone makes this point, I feel obliged to point out that all alternatives to software verification have this exact same problem, AND many, many more.

I found the list of articles on this site amusing:

https://kirancodes.me/posts/log-who-watches-the-watchers.htm...

You can see the clickbaitiness increases over time.

Looks like a normal distribution about the chaos mean to me. I appreciate its not everyones cup of tea, but I like this style of writing.

To me, saying that there is a bug in the lean runtime means lean-zip has a bug is like saying a bug in JRE means a Java app that uses the runtime has a bug, even though the Java app code itself does have a bug. It seems like the author is being intentionally misleading about his findings.

No. It would be like finding a memory unsafe caused bug in a Java application that is due to a bug in the JRE. That would absolutely warrant a title like “I found memory unsafe bug in my Java code” when everyone expects Java code to be memory safe, which is analogous to the article in question.

I do not think you are completely grasping what you are talking about (what is a 'memory unsafe bug'?). Even in the example you give, that title would be literally wrong, as there will be no bug in your Java code; there would be a bug in the execution due to a deviation in the runtime executing your program.

I think it's ambiguous and fair game for the idea of answering the question "if we write programs in this manner, will there be exploitable bugs?

>I think it's ambiguous and fair game for the idea of answering the question "if we write programs in this manner, will there be exploitable bugs?

You're strawmanning the original authors' argument. The creator of lean-zip said that they proved there are no implementation bugs in the lean-zip program. A bug in lean-runtime does not contradict this claim.

> obviously no one’s running any compiled Lean code in any kind of production hot path

Ignorant question: why not? Is there an unacceptable performance penalty? And what's the recommended way in that case to make use of proven Lean code in production that keeps the same guarantees?

Yes, it isn’t performant. Lean isn’t a language for writing software, though you technically can; it’s a language for proving math.

Where are you coming up with this from? This is awfully confident for a fact you seem to have conjured up without evidence. As far as I am aware, Lean is interested in being used as a programming language (see: https://lean-lang.org/functional_programming_in_lean/) and people are deploying Lean in production: https://docs.aws.amazon.com/clean-rooms/latest/userguide/dif...

You’re right, there is that one example. Feels like we’re in exception that proves the rule territory. But I’d be very interested in being proven wrong! This isn’t a desire of mine, just what I’ve seen. Do you have other examples?

Also, part of my confidence comes from both having been a professional programmer for decades, across many languages, and also having programmed in Lean. It’s a great language for math, perhaps the best choice right now. But as a general purpose language it’s incredibly quirky.

> It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to

Well, Lean is written in Lean, so I am pretty sure a runtime bug like this could be exploited to prove `False`. Yes, since the kernel is written in C++, technically it's not the part affected by this bug, but since you cannot use the Lean kernel without the Lean compiler, this difference does not matter.

I think this must be wrong. If the kernel is free of bugs then it's not going to pass a proof of false no matter what the Lean compiler gets up to.

Sure, but are you worried about someone cheating on their arXiv submission by exploiting a buffer overflow? It’s a real bug, it’s just not very important.

> This article’s framing and title are odd.

It's called clickbait.

Misalignment of trust is a category of bug in my book; right next to logic error.

“‘Proven’ code turns out buggy, because of a bug outside the proven specification” is relatively common. It happened in CompCert: https://lobste.rs/s/qlrh1u/runtime_error_formally_verified