Presumably the idea is that an agent generates a Lean4 specification against which the software is measured.

But then the Lean4 specification effectively becomes the software artifact.

And we're sort of back to square 1. How do you verify a Lean4 spec is correct (and that it describes what needs to be built in the first place) without human review?

> And we're sort of back to square 1.

Specifications are smaller than the full code, just as high level code is smaller than the functionally equivalent assembly. As we ascend the abstraction ladder the amount of reading a human needs to do decreases. I don't think this should really count as "back to square 1".

That has always been the perceived promise of higher-abstraction software specs: automated code generation from something higher-level, thus making programmers increasingly obsolete.

  binary => hexadecimal instructions
  hexadecimal => assembly language
  assembly => portable, "high-level" languages (C, FORTRAN, COBOL, etc.)
  HLLs => 3GLs (BASIC, C++, Pascal, Java, C#, JavaScript, etc.)
  3GLs => 4GLs/DSLs/RADs and "low-code/no-code"[0]
Among the RADs is Microsoft Visual Basic, which along with WinForms and SQL was supposed to make business programmers nearly obsolete, but instead became a new onramp into programming.

In particular, I'd like to highlight UML, which was supposed to mostly obsolete programming through auto-generated code from object-oriented class diagrams.[1] The promise was that "business domain experts" could model their domain via visual UML tooling, and the codegen would handle it from there. In practice, UML-built applications became maintenance nightmares.

In every one of these examples, the artifact that people made "instead of programming" became the de-facto programming language, needing to be maintained over time, abstracted, updated, consumed behind APIs, etc. -- and programmers had to be called in to manage the mess.

It's interesting that Spec4 can be auto-generated, then used to generate code. My question is - what do you do when you have (a) consumers depending on a stable API, and (b) requests for new features? Maybe hand the job to Claude Code or a human developer with a suite of unit tests to guarantee API compatibility, but at that point we're back to an agent (LLM or human) doing the work of programming, with the Spec4 code as the programming language being updated and maintained.

[0] https://en.wikipedia.org/wiki/Fourth-generation_programming_...

[1] https://news.ycombinator.com/item?id=26934795

You're touching on the fundamental "who watches the watchmen" problem in formal verification. But I think the framing slightly misses the key asymmetry: reviewing a Lean4 spec is dramatically easier than reviewing the implementation it constrains.

A formal spec in Lean is typically 10-50x shorter than the code it proves correct. More importantly, Lean's type checker is itself a small, trusted kernel (~10k lines) that has been scrutinized by the PL community for years. So you're not trusting the agent — you're trusting the kernel.

The practical workflow isn't "agent writes spec + code." It's: human writes spec (the hard creative part), agent generates proof that code satisfies spec, Lean kernel mechanically checks the proof. The agent can hallucinate all it wants in step 2 — if the proof doesn't typecheck, it gets rejected deterministically.

The real bottleneck is step 1: writing good specs requires domain expertise. But that's exactly where humans should stay in the loop. It's a much better division of labor than reviewing thousands of lines of generated code.

Does that mean your production code is lean? Or do you translate some other language code to lean to verify it?

Also a very good question btw, people do both. For some projects Lean is expressive and performant enough to use on its own (or call into using the reverse FFI), other projects use a model of a real programming language like Rust. The disadvantage of the latter is that the Lean model of Rust has to be trusted.

Do you know if there are some resources or examples of this? Especially actual production stuff, not just side projects or proof of concepts?