Sorry, I'm not sure I follow. We are talking about bugs in a verified system, that is, in this case, a verified implementation of a zlib-based compression tool. Did it have bugs? yes. Several in fact. I'd recommend reading the article for a detailed listing of the bugs in the tool.

You give the answers to this in the artcle:

> The most substantial finding was a heap buffer overflow! but, not in lean-zip's code, but in the Lean runtime itself. (emphasis mine)

> The OOM denial-of-service is straightforward: the archive parser was never verified. (me again)

"Lean proved this program correct" vs "I found bugs in the parts that Lean didn't prove was correct". The failure was not in Lean's proof (which as I said is heavily implied by the title), but in ancillary or unverified code.

Do I misunderstand how Lean works? I am by no means an expert (or even an amateur) on Lean or formal systems in general. Surely the first class of bug could be found in any code that uses the framework, and the second class of bug could happen to any system that isn't proven? Does that make sense? Otherwise where's the boundary? If you find a bug in the OS, does that mean Lean failed somehow? How about the hardware? If your definition of a 'formally verified system' goes beyond the code being verified and the most important thing is whether or not you can make it crash, then the OS and the hardware are also part of the system.

Of course bugs are important to users regardless of the cause, but the audience for your article seems to be software engineers and as a software engineer, your article was interesting and you found something useful, but the title was misleading; that's all I'm saying.

Further to my earlier reply - a more succinct way to look at it might be:

- When they fix the run time, bug A goes away. So the proof still holds and the zlib code is still correct.

- When they add a system of proofs for the parser and modify that, then bug B goes away. So the proof still holds and the zlib code is still correct; and now more of the library is proven correct.

The formulation of the title is "I was told X but that's not true"... but that's not true. You were told X, and X is true, but you found Y and Z which are also important.

What is the program?

There are two different answers to this question, and which one is "correct" depends entirely on the context of who is asking it.

1. It's the code that is specific to this program that sits above the run-time layer (internal view, that most programmers would take).

2. It's the code in the binary that is executed (external view, that most users would take).

The key question does not seem to be "was the proof correct", rather "did the proof cover everything in the program". The answer depends on whether you are looking at it from the perspective of a programmer, or a user. Given the overly strong framing that the article is responding to - highlighting the difference in this way does seem to be useful. The title is correct from the perspective that most users would take.

Yes but, without wishing to be snarky, did you read the article? There is no program as such, in either sense - the announcement from Lean only mentions "a C compression library" (zlib). Not only that, but since we're talking about formal verification, a programmer would likely understand that that is about proving a bounded, specific codebase at source code level, and not operating on a binary along with its associated dependencies (again caveat my limited understanding of these things).

My feeling is that if you told the average non-technical user that some person/organisation had produced a formally verified version of a C compression library, you would likely get a blank look, so I think it's reasonable to assume that both Lean's intended audience, and the audience of the blog post linked here, correspond with number 1. in your list.

The article describes fuzzing the library, this execution requires a program to be compiled. Typically fuzzing involves a minimal harness around the payload (a single call into the library in this case). There is clearly a bug in this program, and it does not exist in the minimal harness. It must be in the library code, which was covered by the proof.

The bounded, specific codebase that you refer to is typically the library *and all of its dependencies*, which in this case includes the Lean runtime. This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations. In this case the original gushing claim that everything was verified is incorrect and premature. The article seems like a good exposition of why.

Thank you, I understand what fuzzing is; that test harness was presumably provided either by the blog post author or generated by Claude somehow, and therefore would not have been part of the proven code, nor part of the original claim by the Lean devs. That's what I meant by saying there is no program as such.

> The bounded, specific codebase that you refer to is typically the library and all of its dependencies, which in this case includes the Lean runtime.

How does that work? I thought the main idea is to write code in the Lean language which has some specific shape conducive to mathematical analysis, along with mathematical proofs that operate on that code. How then does a system like this handle a third party dependency? I've searched around and I can't find any information about how it works. I assumed that the boundary of the proof was the source code - surely they can't also be proving things like, say, DirectX?

> This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations.

The difficulty is not in explosions of computational complexity due to problems of incompleteness, decidability, the halting problem, those kinds of things? As I said this is not something I know much about but it's surprising to me if 'analysing all the code' is really the difficult bit.

There are standard convex assumptions to handle incompleteness, decidability etc, i.e. the results are an over-approximation that terminates. Picking a approximation that is precise enough in the properties that you care about is part of the challenge, but it is an in-band problem. There are no hard edges between the theory and reality.

As with most engineering problems the out-of-band issues tend to be the hardest to solve. Models of the part underneath the interesting part need to be complete/accurate enough to make the results useful. Compare it to crypto where people do not usually try to break the scheme - they try to break the specific implementation of the scheme because the weakest points will be at the interface between the theoretical construction and the actual concrete instantiation of the device that it will run on.

Gentle reminder about this excerpt from HN Guidelines:

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

Noted, thank you

Say you study some piece of software. And it happens that it has an automated suite of tests. And it happens that some files aren't covered by the test suite. And you happen to find a bug in one of those files that were not covered.

Would you publish a blog post titled "the XXX test suite proved there was no bug. And then I found one"?

It would be a bit silly, right?

A test suite never proves anything except that the code works for the test cases in it.

But yes, it would be equivalent to stating "No tests failed but I found a bug" and one would correctly deduce that test coverage is insufficient.

These were my thoughts as well and it's nothing new, I think, regarding testing altogether:

- testing libraries (and in this case - language itself) can have bugs

- what is not covered by tests can have bugs

Additionally would add that tests verify the assumptions of coder, not expectations of the business.

To give benefit to the author - I'd read the article as: having tests for given thing ensures that it does the thing that you built the tests for. This doesn't mean that your application is free of bugs (unless you have 100% coverage, can control entire state of the system, etc) nor that it does the right thing (or that it does the thing the right way)

I like to differentiate between coverage by lines and semantic coverage: sometimes you need to exercise a single line multiple times to get full semantic coverage, and better semantic coverage usually beats larger line coverage for detecting problems and preventing regressions.

I think mutation testing helps in defining semantic coverage, if I understand what you're saying