Buried about 2/3rds of the way through the article: “I’m about 50% of the way through the formalization plan…”
As soon as I saw this, I became a skeptic. Anyone who has used AI tools has seen cases where the first 80% of a project came together like a bolt of lightning, but the last 20% is next to impossible for the AI to accomplish, even if it doesn’t seem more complex than the rest of the code. AI just struggles as the code gets larger and the requests get more specific. Anyway, never trust a programmer to accurately judge the “50%” mark. Once they are at 90%, they will need another 90% to cross the finish line.
I actually want to push back a bit on this. I don’t have the context on what the author has done, but proof engineering is a bit different from software engineering. In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress. You don’t have “last minute surprises” which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above). But there’s also a much clearer sense that if the “contours” of the proof plan are correct, filling them in is much more parallelizable and more straightforward than with programming.
I think partially this is due to the language being pure (so there’s no surprising side effects, global state, or “time”) and partially due to the “if it typechecks, you’re done” nature of proofs. Also mathematics is (unsurprisingly) composable — more than run-of-the-mill software development.
> “if it typechecks, you’re done” nature of proofs.
This resonates with me.
Software engineering is more art than math -- you have to deal with corner cases, exceptions, human interactions. So much of code encodes culture, preferences, storytelling, etc. It's kind of a "procedural epistemology". You are codifying -- without correctness -- the behaviors that you know.
Pure Math doesn't work like that -- it is more declarative and is built on a structure of correct statements. (numerical math is a different animal -- it's closer to engineering -- but proofs are pure math). It's way more limited and restrictive and rigorous than imperative code.
It's like writing in Haskell. If it compiles in Haskell, the result might not be useful in real life, but there's a degree of consistency that languages like Python cannot enforce.
But even Haskell for practical reasons only aims at consistency rather than correctness. Proof assistants like Lean and Coq go all the way in Curry-Howard correspondence.
I don't really see there being much of a difference between the modeling surprises and last-minute surprises only uncovered when the program runs. If the gist of what you're saying is that an experienced proof engineer should have fewer modeling surprises than an experience software engineer has last-minute code execution surprises, then I can sort of agree. But the "contours" of a proof you're describing are basically analogous to decomposing an application into individual components (modules/functions) --- and if you do this and test your components you should expect your software to not have any "last minute surprises" except for modeling failures as well.
I guess the point I mostly disagree with is the feasibility of what you describe here:
> In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress.
This presupposes that you know how the proof should be written, which means that you not only have an understanding of the proof mathematically (whatever your definition of "mathematically" is), but you ALSO know how to warp that understanding into a machine proof. The author notes that the "getting the definitions right" part is slower than them writing out the definitions themselves for a variety of reasons which ultimately mean they had to carefully comb through the definitions to make sure there were no logic errors.
So I don't really disagree with your overall point (which is that the author can probably be trusted to say they are actually at 50%), but I think you oversell the assurances granted by formal verification when compared to software engineering.
It’s not necessarily smooth sailing for sure. But in my (limited!) experience, there is a noticeable difference, so that’s what I wanted to note.
There indeed can be a gap between “it works in my head” and convincing the machine to do it, and sometimes the gap is wide. For example, type-theoretic way of looking a things (as customary in Lean) might not match the idioms in the paper proof, or some crucial lemmas known to be true might have never been formalized. Or the tactics to prove some piece are getting too slow. Or you got the definitions wrong or they’re annoying to deal with. So that is its own set of difficulties.
Even with that, the nature of difficulties feels different from much of software work to me. In software, I find that “divide and conquer” rarely works and I need to “grow” the program from a primitive version to a more sophisticated one. But in proofs, “divide and conquer” actually sort of works — you gain some assurances from making a high-level proof with holes in the middle, then filling in those holes with more intermediate steps, and so on. You can actually work “top down” to a meaningful extent more than in software.
To put it another way, with proofs, you can work top-down and bottom-up at the same time, making real progress while a bunch of steps are stubbed out. In programming, “stubbing out” non-trivial parts of the program is often annoying or impossible, and you can only make progress on a smaller (but complete) piece of it, or a less ambitious (but still complete) piece of it. This is because you can’t run an incomplete program, and there’s a set of knowledge you only gain from running it. But you absolutely can check an incomplete proof, with arbitrary granularity of incompleteness.
So I agree with you it’s not necessarily linear. But it’s also not exactly like software, and analogies from software work don’t apply 1:1.
Your comment basically perfectly describes why Scala is a wonderfully pleasant language to work with. I could review code and know that if it had the right high level steps and if it compiled, it was probably right. When writing code you figure out the high level steps and then write whatever you need to make it type check. If you don't go out of your way to do something perverse, it's probably correct.
Not sure about that. One of the steps in your plan may be surprisingly hard to prove. You're going to solve the easier steps in your plan first and then may get bogged down at 98%.
Sure, which is why the words you copied are followed by these in my comment:
>… which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above)
If you already have a paper proof (which is the case in the original post), and you know it’s correct, then you already know all steps are provable — but if they aren’t translated correctly into the theorem prover’s model, or if some details don’t translate well, or if you’re missing some existing result that hasn’t been formalized, you can get stuck. AI doesn’t make that worse though.
> If you already have a paper proof (which is the case in the original post), and you know it’s correct
You may think it's correct until you realize that your argument transitively depends on some obscure results from the 70s that nobody had ever questioned but that turned out to be flawed. Probably salvageable, but now you suddenly need to prove something completely different that you might not even understand very well. There's an interview with Kevin Buzzard somewhere where he talks about how that happened to them (couldn't find a reference rn).
To formalize: After a few days 90% works, after a few more days 90% of the remaining 10% works, after a few more days.... long story short, it never works.
Seeing this in consulting/ contract dev too. People come to me with “80% done” but in reality it’s the 80/120 rule, the last 20% takes 120% of the time the project should have taken.
> I’m about 50% of the way through the formalization plan
> I think formalizing using AI was way slower than if I’d done it by hand.
combined with their aversion to learning something hard alone:
> I found learning Lean annoying and difficult. So I started looking for shortcuts.
The author failed utterly according to their own definition of success.
Moreover, how will they know they have completed the task if they don't want to fully understand what they've 'written'?
The paper presents a false trichotomy:
(1) go to grad school and learn to do it yourself
(2) hope one of the few theorem proving experts got interested in your problem
(3) give up.
and the magical fourth:
(4) Look Just Try Claude Code (at just $100 per month!)
"got interested" is so utterly passive, I've got a fifth option.
(5) Meet other people who share your interests on discord and forums and chat about the problems you are trying to solve. Make them interested! Nerd snipe them! Make friends, meet colleagues and learn new things while having a genuine human experience, for free!
My experience points to that until the last sorry is eliminated the math might look like it will fit together, but yet not ultimately all fit together.
> Anyone who has used AI tools has seen cases where the first 80% of a project came together like a bolt of lightning, but the last 20% is next to impossible for the AI to accomplish
Offtopic: doing the inverse works better.
If you rely on AI to write 80% of your code, you end up with code you don't understand, so the last 20% is really difficult. You can't even guide the AI correctly at that point.
I actually do the opposite. I write 80% of my code, have an LLM read it, and it helps me iterate faster on the last 20% (which are mostly tedious things like unit tests, corner cases, optimizations, etc). This works surprisingly well because now the AI has a structure to start from, and it's only doing incremental code completion. As a human, I'm also not that great at detecting corner cases -- this is where an LLM really helps because it has seen patterns I've never seen.
In fact, I use coding agents to combine multiple sources of information (not just code) like design docs, LaTeX, etc. and I ask the LLM to detect if the code actually implements the math in the LaTeX correctly. Claude actually does a pretty decent job at that.
I don't really know how to answer that, it's a highly fragmented legacy codebase that is responsible for many millions in revenue.
I do know that given my experiences though it's time to sound the alarms as many others have.
We're clearly in an emergency situation, it's frustrating to see this ignored due to various mechanisms some of which are huge geopolitical incentives to pump out propaganda that ai is safe and we shouldn't slam the brakes, which would cause the us to lose to China or vice versa.
Every nation state actor is trying tirelessly to attenuate panic but it's here, its really happening.
I will quote (approximately) Simon Peyton Jones: "writing programs is not hard. Specifying what a program should do is hard."
Prove what? The author is well versed in the problem of specifications as the aliens-demand-gcc-correctness-or-else post. I also enjoyed the other post where they say "no one cares about correctness" and "people care about compliance".
It is safe to say, most people are not experts and will never become experts.
Being proficient in the business of coming up with suitable specification that can be implemented and getting assurance that an implementation meets the spec will most likely need all the kind of formal training for humans that the AI hype industry assures us is no longer necessary.
So it doesn't much help that LLMs are good at manipulating the tokens when used by experts, at least not in a big way. It can be hoped that they change cost-benefit balance in good ways, but the bigger problems of the industry will simply persist.
What would need to change for a more fundamental shift is getting a lot more people to understand the value of specifications. This is an education problem and paradoxically I could see AI helping a lot... if only there was enough interest in using AI to help humans become experts.
This change is happening as we speak. My work repo is now sprouting with markdowns primarily intended for agentic consumption, but it’s all dense jargon communicating requirements and important design decisions in a way which is useful to me as a software engineer. The toil of writing these specs from low level code is much lower today, because LLMs also help with that immensely.
In my workplace happens this, but in a bad way. There's a push to make use of AI as much as we can to "boost productivity", and the one thing people don't want to do is write documentation. So what ends up happening is that we end up with a bunch of AI documentation that other AIs consume but humans have a harder time following because of the volume of fluff and AI-isms. Shitty documentation still exists and can be worse than before...
Other than humans getting apoplectic at the word "delve" and — emdashes, can you explain and give some examples or say more about how AI-isms hurt readability?
Having encountered this spread across our orgs greenfield codebases which made heavy use of AI in the last 90 days: Restating the same information in slightly different formats, with slightly different levels of detail in several places, in a way that is unnecessary. Like a "get up and running quickly" guide in the documentation which has far more detail than the section it's supposed to be summarizing. Jarringly inconsistent ways of providing information within a given section (a list of endpoints and their purposes, followed by a table of other endpoints, followed by another list of endpoints). Unnecessary bulleted lists all over the places which could read more clearly as single sentences or a short paragraph. Disembodied documentation files nested in the repos that restate the contents of the README, but in a slightly different format/voice. Thousands of single line code comments that just restate what is already clear to the reader if they just read the line it's commenting on. That's before getting into any code quality issues themselves.
I've noticed AI generated docs frequently contain bulleted or numbered lists of trivialities, like file names - AI loves describing "architecture" by listing files with a 5 word summary of what they do which is probably not much more informative than the file name. Superficially it looks like it might be useful, but it doesn't contribute any actually useful context and has very low information density.
A piece of information, or the answer to a question, could exist in the documentation but is not in a format that's easily readable to humans. You ask the AI to add certain information, and it responds with "I already added it". But the AI doesn't "read" documents the way humans read.
For instance, say you need urgent actions from other teams. To this effect you order an AI to write a document and you give it information. The AI produces a document following it's own standard document format with the characteristic AI fluff. But this won't work well, because upon seeing the urgent call for action the teams will rush to understand what they need to do, and they will be greeted by a corporate-pr-sounding document that does not address their urgent needs first and foremost.
Yes, you could tell the AI how to make the document little by little... but at that point you might as well write it manually.
There is a place for informal, prose specs, and I can easily agree that more people are nowadays describing their programs in English.
The context here is formal specs though - adequately and precisely capturing the intended meaning (semantics) in a way that lends itself to formal verification.
Interactive theorem proving is interactive because proof search is intractable; a human articulates the property they want to prove and then performs the "search". Apart from the difficulties of getting to that proof, it can also happen that all goes through and you realize that the property is not exactly what you wanted.
I think you're missing a more internecine debate within software correctness/formal methods circles than the more general one about "writing good/correct software". Deductive theorem proving was all the rage in the 1970s, to the point that prominent people (Tony Hoare, Dijkstra) believed that it's the only way to write correct software. Over the years, the focus has shifted to other methods [1], from model-checking to unsound methods such as concolic testing and even property-based testing and code reviews, which have proven more effective than the deductive proof proponents had envisioned (Tony Hoare admitted his mistake in the nineties).
The problem with deductive proofs is that it's all-or-nothing, i.e. either you prove the theorem or you don't, and it's much harder to turn a knob to trade off cost for confidence and vice-versa than with other methods. Consequently, theorem proving has not benefitted from automation as much as it's alternatives, and is now viewed as having a too low bang-for-the-buck. But if LLMs can help automate deductive theorem proving, maybe that particular approach, which has gradually become less popular within formal methods circles but still has its fans, could make a comeback.
Of coure, specification is equally important in all methods, but the particular method of theorem proving has been disadvantaged compared to other formal and semi-formal methods, and I think that's what the author wishes could change.
[1]: Deductive theorem proving may be more prominent on HN than other formal methods, but as with most things, HN is more representative of things that spark people's imagination than of things that work well. I'm not counting that as a strike against HN, but it's something to be aware of. HN is more Vogue/GQ than the Times, i.e. it's more news about what people wish they were doing than what they're actually doing.
> The problem with deductive proofs is that it's all-or-nothing, i.e. either you prove the theorem or you don't, and it's much harder to turn a knob to trade off cost for confidence and vice-versa than with other methods.
This is not really true though. Sound type checking is a kind of deductive proof, but it's practically usable because it establishes very simple properties and is generally "local", i.e. it follows the same structure as program syntax.
Well, that's true for simple type systems. Dependent types are not so practically usable. You're absolutely right that simple types prove some "simple" properties, but we know what characterises them: they're inductive (or composable) invariants, i.e. P(a ∘ b) ⇔ P(a) ∧ P(b), for some program composition operator ∘ (I'm papering over some important but nuanced details). It's just that most program correctness properties we're interested in are not inductive, so the "proving power" of simple types is not strong (another way of describing it is that it's not powerful enough to express propositions with interleaved universal and existential quantifiers).
So when we talk about formal methods, we're usually interested in those that can express and/or "verify" (to some degree of confidence), non-composable properties and propositions that involve interleaved quantifiers.
>writing programs is not hard. Specifying what a program should do is hard.
But these two things are kinda the same thing. Writing a program is mostly just specifying what it should do in a very specific, unambiguous way. That’s why we have programming languages and don’t use English.
A PRD, a design doc and the code itself are kind of all the same thing, but with increasing levels of specification.
If you’ve ever played with formal specifications at some point you probably had the realization “wait a minute, I’m just writing another implementation of the program!” and in a way you are.
Spoken language descriptions of how to do something are incredibly underdetermined. Many people are likely familiar with the meme of following exact instructions on making a PB&J where a phrase like “put a knife in your hand and spread the peanut butter” can result in stabbing yourself and rubbing peanut butter on the wound. More than half of my job is helping other people know or otherwise determining what they really want a program to do in the first place.
"The program should accept one parameter as a number, and output this number multiplied by two" is a spec.
input a
print a*2
is a program.
But: why do that? why multiply by two? what does two represent? what do the input and output mean in the real world? what part is likely to change in the near future? etc.
Wrting the program isn't harder than writing the spec. Analyzing the problem and coming up with solutions is the hard part.
(And yes, LLMs are getting better at this everyday, but I think they still have a very long way to go.)
> What I’ve found is that given a tool that can detect mistakes, the agent can often correct them
This is the most important line of the entire article
When iterating on a Manifesto for AI Software Development (https://metamagic.substack.com/p/manifesto-for-ai-software-d...) over the last two years the key attribute more than any other that I found was empirical validation. While AI (and humans) are not able to accurately judge their own work when we give AI (and human) the ability to do empirical validation its success skyrockets. This might be intuitive, but there are still papers testing that this applies to AI too. While reaching to have the AI write unit tests I've been embracing fuzzing because then AI can't cheat with bonus tests. The idea of reaching back to school and using interactive theorem proving didn't even cross my mind and now that it has been presented it is a whole paradigm shift on how to push my AI use forward so it can work even more autonomously.
AI can iterate at speeds humans can't. When it has even basic empirical validation (building the code, running tests) it removes the human from the loop. Move that to using fuzzing (such as with golang) and you get way better coverage and way better progress before a human has to intervene. So it isn't a surprise that interactive theorem proving is a perfect match for AI.
It is interesting how this same lesson plays out elsewhere, earlier in the article
> Why is ITP so hard? Some reasons are quite obvious: interfaces are confusing, libraries are sparse, documentation is poor, error messages are mysterious. But these aren’t interesting problems
Remember when llvm got really good c++ error messages and it was life changing? High quality error messages means we could find/fix the error fast and iterate fast. These are actually the MOST interesting problems because it enable the user to learn faster. When a user has high success they will use a product again and again. High quality error messages in all tools will enable Claude code to be able to work longer on problems without human intervention, make less mistakes and overall work faster.
While error messages should always be good a new question that really hammers this home is "When AI encounters this error message, can it fix the problem?"
It’s actually extremely good bang for buck now to invest more time in various linters or static checking that enforce more correctness / project-specific rules.
Not only does the AI help you write the linter in 10% of a time - linters are usually fairly short scripts that are easy to test in practice, so the happy path for AI).
But asking the AI “make this linter pass” after you’ve e.g. done some extensive refactors, is very effective and time-saving.
I think everyone who has heard of interactive theorem provers saw that this is going to be a match made in heaven.
LLMs reduce implementation effort but pile on massive review effort. Interactive theorem provers reduce review effort but the implementation is much more work.
Together you get the best of both worlds at the speed of machine thought.
Combining generative ML (and other heuristic methods) with formal methods is the most promising way forward for systems design. Without formal methods (and other constraints such as limiting systems complexity) we are going to vibe-code ourselves to chaos.
> I find this very surprising, and you probably should too.
Why? My first assumption would be that Claude Code would be exceptionally good at writing Lean proofs, for many of the reasons given in the article. It's been on my todo list for a while to try out some of my coding workflows on proof writing. I'm glad OP did.
But why does he find this surprising? That wasn't laid out.
EDIT: To be clear, I'm talking about this bit:
> I think this is part of why Claude Code is surprisingly good at theorem proving. Lean is very strict in the programs it will accept. This makes writing Lean arduous for humans, but it means an AI agent gets detailed feedback.
I first encountered this in writing Rust programs, first with Cursor (using anthropic models), and now with Claude Code. Things have gotten better, but initially Rust was not very well represented in the training sets used, and the agents would be horrendous beginner mistakes. Writing Javascript-like code with Rust-ish syntax, and expecting it to work.
But the rustc compiler gives very good errors, with context and suggestions for fixing. In two or three iterations it would fix all the mistakes, without any need for me to get involved. And because Rust's type system and safety guarantees are so strict, by the time it gets it to compile it _probably_ works.
It's been my assumption since that experience (my first experience using a coding agent), that it would be very good at writing machine-checked proofs.
I’ve been working with gpt-5-high recently and while it’s slow as molassess it’s also pretty darn impressive in what kinds of problems it can solve. I’ve given it coding tasks which Claude fumbled and it dealt with them given the same set of tools. It was honestly impressive; it feels like it’s on the verge of unlocking a new capability by being reliable enough.
I wouldn’t be surprised if the author was taken off guard by gpt-5 if he was already impressed by Claude code.
Among other general advice, my CLAUDE.md insists Claude prove to me each unit of change works as expected, and I'm usually just hoping for it to write tests and convince me they're actually running and passing. A proof assistant seems overkill here, and yet Claude often struggles to assemble these informal proofs. I can see the benefit of a more formal proof language, along with adding a source of programmatic feedback, compared to open-ended verbal proof.
"Overkill" of course is an editorial word, and if you know about https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspon... then you know many statically typed programming languages are essentially proof assistants, where the proof goal is producing well-typed programs. LLMs are already quite good at interacting with these programming language proof assistants, as you can see any time a competent LLM interacts with the Rust borrow checker, for example.
I've worked with Claude Code/Codex, and they're amazing, but I have a question.
I have never worked with tools such as Lean, but if an LLM writes a proof with Lean, how can one be sure that the code itself (in the proof) is correct? Or is Lean so strict that this is unlikely to happen?
If it compiles, it typechecks. If it typechecks, the proof is correct. This is a consequence of the Curry-Howard correspondence [0].
From a pure mathematician's standpoint, the content of a proof is kind of (waving big hands here) irrelevant. It's more that the existence of a proof implies the theorem is true, and thus can be used to prove other theorems. Because of this "proof-irrelevance", a great foil to an LLM's propensity to hallucinate is something like Lean. In Lean, and in theorem provers oriented towards classical mathematics, the statement of the theorem being proved is what matters most. And due to Curry-Howard, the statement is equivalent to the signature/declaration of the function, which is what the human ought to verify, while Lean will verify the body.
The thing I'm stuck on is that, yes, the proof is correct.
But what guarantees do I have that the proof or theorem is actually representative of the program I'm trying to specify?
And beyond that, what guarantees are there that the resulting end program actually adheres to either the specs or what has been proven?
Like it seems the goal here is `English -> Lean -> Code in whatever language` instead of just `English -> Code` but there's no way to be certain the translation step on either side of Lean has actually been done correctly.
I've done a lot of experiments around this so far this year and theorem provers can certainly help provide LLMs with more context but there is no tooling to actually verify that the resulting Python fully covers the corresponding Lean that was fed in.
It's often said that formal verification moves bugs from the implementation to the specification, and there's also the problem of proving equality between the formally proven specification and the implementation.
The first problem is just hard, the second can be approached by using a language like F* or Dafny that compiles to executable code. IIRC Amazon has had some success by using Lean for the specification and using fuzzing to test that the implementation matches.
You're using an LLM to translate Lean to Python? Care to tell us more?
If you want to prove stuff about programs then Lean is a very capable tool. But it's intended that you write the program itself in Lean and not in some other language.
I guess if you ask an LLM nicely it will attempt to translate a program that has already been written in Lean to one in Python, but why would you want to do that? I doubt it will run any faster.
> If it compiles, it typechecks. If it typechecks, the proof is correct. This is a consequence of the Curry-Howard correspondence
I think this perhaps the most common misconception of how formal verification works. If the program typechecks, then you have a proof that the program (or often a model of the program) implements the specification. However, you still have to make sure that your _specification_ is correct! This is one of the hardest parts of formal verification!
This neatly sums up my experience with Claude Code. It’s a brilliant tool - but one that often requires a tight leash. The challenge is that you won’t know when or where until you’ve used it extensively for your specific use case.
In the author’s case, producing a formal proof that inspires little confidence feels counterproductive. It’s very likely that we’re missing a key piece here: perhaps what’s needed is a model trained specifically to keep large language models on task, verify their output, and challenge them on our behalf.
That said, deep domain knowledge remains essential. Without it, things can easily get built in odd or unintended ways.
A practical workaround (for typical app development/not formal verification) may be to treat the system as a black box - relying on carefully written specifications or test cases. In many situations, that approach could be more effective than trying to wrestle certainty out of inherently uncertain processes.
I'm not at all surprised that LLMs are good at interactive theorem proving.
But I am surprised that Claude Code is good at it. Maybe I'm just not using it right but when I have a task in a less popular programming language, Gemini 2.5 pro tends to be much better than Claude Opus despite Opus doing better on benchmarks. Better in the sense that Gemini will typically bang out a correct solution with best practices immediately and Opus might take half a day of correcting basic errors and misunderstandings.
If anyone knows what I'm talking about and knows a way to improve this please let me know. Claude still feels to me like an over-eager engineer that is more eager to get thing done fast than to do them correctly. It may be that with a task like theorem proving you can just burn credits and time because there's a clear answer.
I use the Gemini mcp and my CLAUDE.md has instructions to consult Gemini for strategy. Seems to work well. I have the lowest Claude plan so I don’t know how this would work vs Opus, for example.
Separately, I have been meaning to implement a cheating detector — have run into Claude modifying problem statements, adding axioms, etc.
> have run into Claude modifying problem statements, adding axioms, etc.
Same here. I've thought about creating a utility that tells Claude it has to keep going until a test exits with nonzero status. But I'm concerned Claude would just fake everything to make the test pass.
You could do that -- my understanding is that MCPs give Claude less to "think about" than having to use another program correctly, and therefore avoid context clutter. I could be wrong.
This matches my experience with Claude on other projects. Its output is much better when it can work against detailed specifications and tests, and Lean seems to be an extreme case of that.
My issue with formal proofs for code is that it assumes that the code is static and requirements are fixed and won't change. This assumption doesn't hold up in the vast majority of projects.
Surely LLMs can make this more practical but I think many projects suffer from a more mundane problem that they aren't architected properly. Without proper separation of concerns and proper abstraction, you won't get stability in the base modules, which means that you will have to re-evaluate and re-write tests and formal proofs constantly.
How hypocritical of someone preaching the virtues of formal proofs to be using a single example to try to prove the 'for all' case... As if it's feasible for all systems, for all requirements, under all possible financial constraints.
Oh yes, I'm the LLM here.
Maybe you should run your comments through your Coq proof-assistant before you post them!
Buried about 2/3rds of the way through the article: “I’m about 50% of the way through the formalization plan…”
As soon as I saw this, I became a skeptic. Anyone who has used AI tools has seen cases where the first 80% of a project came together like a bolt of lightning, but the last 20% is next to impossible for the AI to accomplish, even if it doesn’t seem more complex than the rest of the code. AI just struggles as the code gets larger and the requests get more specific. Anyway, never trust a programmer to accurately judge the “50%” mark. Once they are at 90%, they will need another 90% to cross the finish line.
I actually want to push back a bit on this. I don’t have the context on what the author has done, but proof engineering is a bit different from software engineering. In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress. You don’t have “last minute surprises” which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above). But there’s also a much clearer sense that if the “contours” of the proof plan are correct, filling them in is much more parallelizable and more straightforward than with programming.
I think partially this is due to the language being pure (so there’s no surprising side effects, global state, or “time”) and partially due to the “if it typechecks, you’re done” nature of proofs. Also mathematics is (unsurprisingly) composable — more than run-of-the-mill software development.
> “if it typechecks, you’re done” nature of proofs.
This resonates with me.
Software engineering is more art than math -- you have to deal with corner cases, exceptions, human interactions. So much of code encodes culture, preferences, storytelling, etc. It's kind of a "procedural epistemology". You are codifying -- without correctness -- the behaviors that you know.
Pure Math doesn't work like that -- it is more declarative and is built on a structure of correct statements. (numerical math is a different animal -- it's closer to engineering -- but proofs are pure math). It's way more limited and restrictive and rigorous than imperative code.
It's like writing in Haskell. If it compiles in Haskell, the result might not be useful in real life, but there's a degree of consistency that languages like Python cannot enforce.
But even Haskell for practical reasons only aims at consistency rather than correctness. Proof assistants like Lean and Coq go all the way in Curry-Howard correspondence.
I don't really see there being much of a difference between the modeling surprises and last-minute surprises only uncovered when the program runs. If the gist of what you're saying is that an experienced proof engineer should have fewer modeling surprises than an experience software engineer has last-minute code execution surprises, then I can sort of agree. But the "contours" of a proof you're describing are basically analogous to decomposing an application into individual components (modules/functions) --- and if you do this and test your components you should expect your software to not have any "last minute surprises" except for modeling failures as well.
I guess the point I mostly disagree with is the feasibility of what you describe here:
> In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress.
This presupposes that you know how the proof should be written, which means that you not only have an understanding of the proof mathematically (whatever your definition of "mathematically" is), but you ALSO know how to warp that understanding into a machine proof. The author notes that the "getting the definitions right" part is slower than them writing out the definitions themselves for a variety of reasons which ultimately mean they had to carefully comb through the definitions to make sure there were no logic errors.
So I don't really disagree with your overall point (which is that the author can probably be trusted to say they are actually at 50%), but I think you oversell the assurances granted by formal verification when compared to software engineering.
It’s not necessarily smooth sailing for sure. But in my (limited!) experience, there is a noticeable difference, so that’s what I wanted to note.
There indeed can be a gap between “it works in my head” and convincing the machine to do it, and sometimes the gap is wide. For example, type-theoretic way of looking a things (as customary in Lean) might not match the idioms in the paper proof, or some crucial lemmas known to be true might have never been formalized. Or the tactics to prove some piece are getting too slow. Or you got the definitions wrong or they’re annoying to deal with. So that is its own set of difficulties.
Even with that, the nature of difficulties feels different from much of software work to me. In software, I find that “divide and conquer” rarely works and I need to “grow” the program from a primitive version to a more sophisticated one. But in proofs, “divide and conquer” actually sort of works — you gain some assurances from making a high-level proof with holes in the middle, then filling in those holes with more intermediate steps, and so on. You can actually work “top down” to a meaningful extent more than in software.
To put it another way, with proofs, you can work top-down and bottom-up at the same time, making real progress while a bunch of steps are stubbed out. In programming, “stubbing out” non-trivial parts of the program is often annoying or impossible, and you can only make progress on a smaller (but complete) piece of it, or a less ambitious (but still complete) piece of it. This is because you can’t run an incomplete program, and there’s a set of knowledge you only gain from running it. But you absolutely can check an incomplete proof, with arbitrary granularity of incompleteness.
So I agree with you it’s not necessarily linear. But it’s also not exactly like software, and analogies from software work don’t apply 1:1.
Your comment basically perfectly describes why Scala is a wonderfully pleasant language to work with. I could review code and know that if it had the right high level steps and if it compiled, it was probably right. When writing code you figure out the high level steps and then write whatever you need to make it type check. If you don't go out of your way to do something perverse, it's probably correct.
> You don’t have “last minute surprises”
Not sure about that. One of the steps in your plan may be surprisingly hard to prove. You're going to solve the easier steps in your plan first and then may get bogged down at 98%.
Sure, which is why the words you copied are followed by these in my comment:
>… which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above)
If you already have a paper proof (which is the case in the original post), and you know it’s correct, then you already know all steps are provable — but if they aren’t translated correctly into the theorem prover’s model, or if some details don’t translate well, or if you’re missing some existing result that hasn’t been formalized, you can get stuck. AI doesn’t make that worse though.
> If you already have a paper proof (which is the case in the original post), and you know it’s correct
You may think it's correct until you realize that your argument transitively depends on some obscure results from the 70s that nobody had ever questioned but that turned out to be flawed. Probably salvageable, but now you suddenly need to prove something completely different that you might not even understand very well. There's an interview with Kevin Buzzard somewhere where he talks about how that happened to them (couldn't find a reference rn).
To formalize: After a few days 90% works, after a few more days 90% of the remaining 10% works, after a few more days.... long story short, it never works.
AI is very fast at generating working self contained complicated code, which makes it too easy to establish a difficult to finish architecture.
If you don't specify otherwise, yes.
With respect, if vibe coding isn't giving you the results you want, the problem is no longer the AI.
[dead]
Seeing this in consulting/ contract dev too. People come to me with “80% done” but in reality it’s the 80/120 rule, the last 20% takes 120% of the time the project should have taken.
Ah the classic:
"That will be about 40 hours work".
"Well I've done 80% of the work already, so-"
"Oh, then that will be about 80 hours of work".
Christ, you are so correct.
> I’m about 50% of the way through the formalization plan > I think formalizing using AI was way slower than if I’d done it by hand.
combined with their aversion to learning something hard alone:
> I found learning Lean annoying and difficult. So I started looking for shortcuts.
The author failed utterly according to their own definition of success. Moreover, how will they know they have completed the task if they don't want to fully understand what they've 'written'?
The paper presents a false trichotomy:
(1) go to grad school and learn to do it yourself (2) hope one of the few theorem proving experts got interested in your problem (3) give up. and the magical fourth: (4) Look Just Try Claude Code (at just $100 per month!)
"got interested" is so utterly passive, I've got a fifth option. (5) Meet other people who share your interests on discord and forums and chat about the problems you are trying to solve. Make them interested! Nerd snipe them! Make friends, meet colleagues and learn new things while having a genuine human experience, for free!
You’re absolutely right about this.
My experience points to that until the last sorry is eliminated the math might look like it will fit together, but yet not ultimately all fit together.
> Anyone who has used AI tools has seen cases where the first 80% of a project came together like a bolt of lightning, but the last 20% is next to impossible for the AI to accomplish
Offtopic: doing the inverse works better.
If you rely on AI to write 80% of your code, you end up with code you don't understand, so the last 20% is really difficult. You can't even guide the AI correctly at that point.
I actually do the opposite. I write 80% of my code, have an LLM read it, and it helps me iterate faster on the last 20% (which are mostly tedious things like unit tests, corner cases, optimizations, etc). This works surprisingly well because now the AI has a structure to start from, and it's only doing incremental code completion. As a human, I'm also not that great at detecting corner cases -- this is where an LLM really helps because it has seen patterns I've never seen.
In fact, I use coding agents to combine multiple sources of information (not just code) like design docs, LaTeX, etc. and I ask the LLM to detect if the code actually implements the math in the LaTeX correctly. Claude actually does a pretty decent job at that.
The problem is you're thinking of projects in the sense of 0-100.
If you think of projects in the sense of 0-20, 20-40, 40-60, 60-80, 80-100 you realize AI is here to stay and is going to change everything.
WYSIWYG GUI editing is also ‘here to stay’.
80-100 will take orders of magnitude more time (might even be impossible) if you don't understand 0-80 thoroughly.
Nope.
I regularly do 80-100 chunks in a mature codebase. Literally every day at work.
I know it feels good to pretend this isn't happening but it's too dangerous to pretend.
How many LOC?
I don't really know how to answer that, it's a highly fragmented legacy codebase that is responsible for many millions in revenue.
I do know that given my experiences though it's time to sound the alarms as many others have.
We're clearly in an emergency situation, it's frustrating to see this ignored due to various mechanisms some of which are huge geopolitical incentives to pump out propaganda that ai is safe and we shouldn't slam the brakes, which would cause the us to lose to China or vice versa.
Every nation state actor is trying tirelessly to attenuate panic but it's here, its really happening.
Has been happening since chatgpt 1.
I am tired boss.
I will quote (approximately) Simon Peyton Jones: "writing programs is not hard. Specifying what a program should do is hard."
Prove what? The author is well versed in the problem of specifications as the aliens-demand-gcc-correctness-or-else post. I also enjoyed the other post where they say "no one cares about correctness" and "people care about compliance".
It is safe to say, most people are not experts and will never become experts.
Being proficient in the business of coming up with suitable specification that can be implemented and getting assurance that an implementation meets the spec will most likely need all the kind of formal training for humans that the AI hype industry assures us is no longer necessary.
So it doesn't much help that LLMs are good at manipulating the tokens when used by experts, at least not in a big way. It can be hoped that they change cost-benefit balance in good ways, but the bigger problems of the industry will simply persist.
What would need to change for a more fundamental shift is getting a lot more people to understand the value of specifications. This is an education problem and paradoxically I could see AI helping a lot... if only there was enough interest in using AI to help humans become experts.
This change is happening as we speak. My work repo is now sprouting with markdowns primarily intended for agentic consumption, but it’s all dense jargon communicating requirements and important design decisions in a way which is useful to me as a software engineer. The toil of writing these specs from low level code is much lower today, because LLMs also help with that immensely.
In my workplace happens this, but in a bad way. There's a push to make use of AI as much as we can to "boost productivity", and the one thing people don't want to do is write documentation. So what ends up happening is that we end up with a bunch of AI documentation that other AIs consume but humans have a harder time following because of the volume of fluff and AI-isms. Shitty documentation still exists and can be worse than before...
Other than humans getting apoplectic at the word "delve" and — emdashes, can you explain and give some examples or say more about how AI-isms hurt readability?
Having encountered this spread across our orgs greenfield codebases which made heavy use of AI in the last 90 days: Restating the same information in slightly different formats, with slightly different levels of detail in several places, in a way that is unnecessary. Like a "get up and running quickly" guide in the documentation which has far more detail than the section it's supposed to be summarizing. Jarringly inconsistent ways of providing information within a given section (a list of endpoints and their purposes, followed by a table of other endpoints, followed by another list of endpoints). Unnecessary bulleted lists all over the places which could read more clearly as single sentences or a short paragraph. Disembodied documentation files nested in the repos that restate the contents of the README, but in a slightly different format/voice. Thousands of single line code comments that just restate what is already clear to the reader if they just read the line it's commenting on. That's before getting into any code quality issues themselves.
I've noticed AI generated docs frequently contain bulleted or numbered lists of trivialities, like file names - AI loves describing "architecture" by listing files with a 5 word summary of what they do which is probably not much more informative than the file name. Superficially it looks like it might be useful, but it doesn't contribute any actually useful context and has very low information density.
[dead]
A piece of information, or the answer to a question, could exist in the documentation but is not in a format that's easily readable to humans. You ask the AI to add certain information, and it responds with "I already added it". But the AI doesn't "read" documents the way humans read.
For instance, say you need urgent actions from other teams. To this effect you order an AI to write a document and you give it information. The AI produces a document following it's own standard document format with the characteristic AI fluff. But this won't work well, because upon seeing the urgent call for action the teams will rush to understand what they need to do, and they will be greeted by a corporate-pr-sounding document that does not address their urgent needs first and foremost.
Yes, you could tell the AI how to make the document little by little... but at that point you might as well write it manually.
There is a place for informal, prose specs, and I can easily agree that more people are nowadays describing their programs in English.
The context here is formal specs though - adequately and precisely capturing the intended meaning (semantics) in a way that lends itself to formal verification.
Interactive theorem proving is interactive because proof search is intractable; a human articulates the property they want to prove and then performs the "search". Apart from the difficulties of getting to that proof, it can also happen that all goes through and you realize that the property is not exactly what you wanted.
I’m curious if your markdowns are discernible by other engineers? Or are they purely only discernible by you and an LLM?
I think you're missing a more internecine debate within software correctness/formal methods circles than the more general one about "writing good/correct software". Deductive theorem proving was all the rage in the 1970s, to the point that prominent people (Tony Hoare, Dijkstra) believed that it's the only way to write correct software. Over the years, the focus has shifted to other methods [1], from model-checking to unsound methods such as concolic testing and even property-based testing and code reviews, which have proven more effective than the deductive proof proponents had envisioned (Tony Hoare admitted his mistake in the nineties).
The problem with deductive proofs is that it's all-or-nothing, i.e. either you prove the theorem or you don't, and it's much harder to turn a knob to trade off cost for confidence and vice-versa than with other methods. Consequently, theorem proving has not benefitted from automation as much as it's alternatives, and is now viewed as having a too low bang-for-the-buck. But if LLMs can help automate deductive theorem proving, maybe that particular approach, which has gradually become less popular within formal methods circles but still has its fans, could make a comeback.
Of coure, specification is equally important in all methods, but the particular method of theorem proving has been disadvantaged compared to other formal and semi-formal methods, and I think that's what the author wishes could change.
[1]: Deductive theorem proving may be more prominent on HN than other formal methods, but as with most things, HN is more representative of things that spark people's imagination than of things that work well. I'm not counting that as a strike against HN, but it's something to be aware of. HN is more Vogue/GQ than the Times, i.e. it's more news about what people wish they were doing than what they're actually doing.
> The problem with deductive proofs is that it's all-or-nothing, i.e. either you prove the theorem or you don't, and it's much harder to turn a knob to trade off cost for confidence and vice-versa than with other methods.
This is not really true though. Sound type checking is a kind of deductive proof, but it's practically usable because it establishes very simple properties and is generally "local", i.e. it follows the same structure as program syntax.
Well, that's true for simple type systems. Dependent types are not so practically usable. You're absolutely right that simple types prove some "simple" properties, but we know what characterises them: they're inductive (or composable) invariants, i.e. P(a ∘ b) ⇔ P(a) ∧ P(b), for some program composition operator ∘ (I'm papering over some important but nuanced details). It's just that most program correctness properties we're interested in are not inductive, so the "proving power" of simple types is not strong (another way of describing it is that it's not powerful enough to express propositions with interleaved universal and existential quantifiers).
So when we talk about formal methods, we're usually interested in those that can express and/or "verify" (to some degree of confidence), non-composable properties and propositions that involve interleaved quantifiers.
>writing programs is not hard. Specifying what a program should do is hard.
But these two things are kinda the same thing. Writing a program is mostly just specifying what it should do in a very specific, unambiguous way. That’s why we have programming languages and don’t use English.
A PRD, a design doc and the code itself are kind of all the same thing, but with increasing levels of specification.
If you’ve ever played with formal specifications at some point you probably had the realization “wait a minute, I’m just writing another implementation of the program!” and in a way you are.
Spoken language descriptions of how to do something are incredibly underdetermined. Many people are likely familiar with the meme of following exact instructions on making a PB&J where a phrase like “put a knife in your hand and spread the peanut butter” can result in stabbing yourself and rubbing peanut butter on the wound. More than half of my job is helping other people know or otherwise determining what they really want a program to do in the first place.
This sounds true but isn't.
"The program should accept one parameter as a number, and output this number multiplied by two" is a spec.
is a program.But: why do that? why multiply by two? what does two represent? what do the input and output mean in the real world? what part is likely to change in the near future? etc.
Wrting the program isn't harder than writing the spec. Analyzing the problem and coming up with solutions is the hard part.
(And yes, LLMs are getting better at this everyday, but I think they still have a very long way to go.)
You read like Alan Kay's Quora and for that I salute you comrade.
> What I’ve found is that given a tool that can detect mistakes, the agent can often correct them
This is the most important line of the entire article
When iterating on a Manifesto for AI Software Development (https://metamagic.substack.com/p/manifesto-for-ai-software-d...) over the last two years the key attribute more than any other that I found was empirical validation. While AI (and humans) are not able to accurately judge their own work when we give AI (and human) the ability to do empirical validation its success skyrockets. This might be intuitive, but there are still papers testing that this applies to AI too. While reaching to have the AI write unit tests I've been embracing fuzzing because then AI can't cheat with bonus tests. The idea of reaching back to school and using interactive theorem proving didn't even cross my mind and now that it has been presented it is a whole paradigm shift on how to push my AI use forward so it can work even more autonomously.
AI can iterate at speeds humans can't. When it has even basic empirical validation (building the code, running tests) it removes the human from the loop. Move that to using fuzzing (such as with golang) and you get way better coverage and way better progress before a human has to intervene. So it isn't a surprise that interactive theorem proving is a perfect match for AI.
It is interesting how this same lesson plays out elsewhere, earlier in the article
> Why is ITP so hard? Some reasons are quite obvious: interfaces are confusing, libraries are sparse, documentation is poor, error messages are mysterious. But these aren’t interesting problems
Remember when llvm got really good c++ error messages and it was life changing? High quality error messages means we could find/fix the error fast and iterate fast. These are actually the MOST interesting problems because it enable the user to learn faster. When a user has high success they will use a product again and again. High quality error messages in all tools will enable Claude code to be able to work longer on problems without human intervention, make less mistakes and overall work faster.
While error messages should always be good a new question that really hammers this home is "When AI encounters this error message, can it fix the problem?"
This.
It’s actually extremely good bang for buck now to invest more time in various linters or static checking that enforce more correctness / project-specific rules.
Not only does the AI help you write the linter in 10% of a time - linters are usually fairly short scripts that are easy to test in practice, so the happy path for AI).
But asking the AI “make this linter pass” after you’ve e.g. done some extensive refactors, is very effective and time-saving.
I think everyone who has heard of interactive theorem provers saw that this is going to be a match made in heaven.
LLMs reduce implementation effort but pile on massive review effort. Interactive theorem provers reduce review effort but the implementation is much more work.
Together you get the best of both worlds at the speed of machine thought.
Combining generative ML (and other heuristic methods) with formal methods is the most promising way forward for systems design. Without formal methods (and other constraints such as limiting systems complexity) we are going to vibe-code ourselves to chaos.
> we are going to vibe-code ourselves to chaos.
To be fair, we've been doing that even without AI for the past 5 decades.
> I find this very surprising, and you probably should too.
Why? My first assumption would be that Claude Code would be exceptionally good at writing Lean proofs, for many of the reasons given in the article. It's been on my todo list for a while to try out some of my coding workflows on proof writing. I'm glad OP did.
But why does he find this surprising? That wasn't laid out.
EDIT: To be clear, I'm talking about this bit:
> I think this is part of why Claude Code is surprisingly good at theorem proving. Lean is very strict in the programs it will accept. This makes writing Lean arduous for humans, but it means an AI agent gets detailed feedback.
I first encountered this in writing Rust programs, first with Cursor (using anthropic models), and now with Claude Code. Things have gotten better, but initially Rust was not very well represented in the training sets used, and the agents would be horrendous beginner mistakes. Writing Javascript-like code with Rust-ish syntax, and expecting it to work.
But the rustc compiler gives very good errors, with context and suggestions for fixing. In two or three iterations it would fix all the mistakes, without any need for me to get involved. And because Rust's type system and safety guarantees are so strict, by the time it gets it to compile it _probably_ works.
It's been my assumption since that experience (my first experience using a coding agent), that it would be very good at writing machine-checked proofs.
I’ve been working with gpt-5-high recently and while it’s slow as molassess it’s also pretty darn impressive in what kinds of problems it can solve. I’ve given it coding tasks which Claude fumbled and it dealt with them given the same set of tools. It was honestly impressive; it feels like it’s on the verge of unlocking a new capability by being reliable enough.
I wouldn’t be surprised if the author was taken off guard by gpt-5 if he was already impressed by Claude code.
Among other general advice, my CLAUDE.md insists Claude prove to me each unit of change works as expected, and I'm usually just hoping for it to write tests and convince me they're actually running and passing. A proof assistant seems overkill here, and yet Claude often struggles to assemble these informal proofs. I can see the benefit of a more formal proof language, along with adding a source of programmatic feedback, compared to open-ended verbal proof.
"Overkill" of course is an editorial word, and if you know about https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspon... then you know many statically typed programming languages are essentially proof assistants, where the proof goal is producing well-typed programs. LLMs are already quite good at interacting with these programming language proof assistants, as you can see any time a competent LLM interacts with the Rust borrow checker, for example.
I've worked with Claude Code/Codex, and they're amazing, but I have a question.
I have never worked with tools such as Lean, but if an LLM writes a proof with Lean, how can one be sure that the code itself (in the proof) is correct? Or is Lean so strict that this is unlikely to happen?
If it compiles, it typechecks. If it typechecks, the proof is correct. This is a consequence of the Curry-Howard correspondence [0].
From a pure mathematician's standpoint, the content of a proof is kind of (waving big hands here) irrelevant. It's more that the existence of a proof implies the theorem is true, and thus can be used to prove other theorems. Because of this "proof-irrelevance", a great foil to an LLM's propensity to hallucinate is something like Lean. In Lean, and in theorem provers oriented towards classical mathematics, the statement of the theorem being proved is what matters most. And due to Curry-Howard, the statement is equivalent to the signature/declaration of the function, which is what the human ought to verify, while Lean will verify the body.
[0]: https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspon....
The thing I'm stuck on is that, yes, the proof is correct.
But what guarantees do I have that the proof or theorem is actually representative of the program I'm trying to specify?
And beyond that, what guarantees are there that the resulting end program actually adheres to either the specs or what has been proven?
Like it seems the goal here is `English -> Lean -> Code in whatever language` instead of just `English -> Code` but there's no way to be certain the translation step on either side of Lean has actually been done correctly.
I've done a lot of experiments around this so far this year and theorem provers can certainly help provide LLMs with more context but there is no tooling to actually verify that the resulting Python fully covers the corresponding Lean that was fed in.
It's often said that formal verification moves bugs from the implementation to the specification, and there's also the problem of proving equality between the formally proven specification and the implementation.
The first problem is just hard, the second can be approached by using a language like F* or Dafny that compiles to executable code. IIRC Amazon has had some success by using Lean for the specification and using fuzzing to test that the implementation matches.
You're using an LLM to translate Lean to Python? Care to tell us more?
If you want to prove stuff about programs then Lean is a very capable tool. But it's intended that you write the program itself in Lean and not in some other language.
I guess if you ask an LLM nicely it will attempt to translate a program that has already been written in Lean to one in Python, but why would you want to do that? I doubt it will run any faster.
> If it compiles, it typechecks. If it typechecks, the proof is correct. This is a consequence of the Curry-Howard correspondence
I think this perhaps the most common misconception of how formal verification works. If the program typechecks, then you have a proof that the program (or often a model of the program) implements the specification. However, you still have to make sure that your _specification_ is correct! This is one of the hardest parts of formal verification!
Exactly. It does not mean your formal specification is correct (of the code and what it should do).
If it compiles it definitely proves something, which may or may not be what Claude tells you it proves.
This neatly sums up my experience with Claude Code. It’s a brilliant tool - but one that often requires a tight leash. The challenge is that you won’t know when or where until you’ve used it extensively for your specific use case.
In the author’s case, producing a formal proof that inspires little confidence feels counterproductive. It’s very likely that we’re missing a key piece here: perhaps what’s needed is a model trained specifically to keep large language models on task, verify their output, and challenge them on our behalf.
That said, deep domain knowledge remains essential. Without it, things can easily get built in odd or unintended ways.
A practical workaround (for typical app development/not formal verification) may be to treat the system as a black box - relying on carefully written specifications or test cases. In many situations, that approach could be more effective than trying to wrestle certainty out of inherently uncertain processes.
I'm not at all surprised that LLMs are good at interactive theorem proving.
But I am surprised that Claude Code is good at it. Maybe I'm just not using it right but when I have a task in a less popular programming language, Gemini 2.5 pro tends to be much better than Claude Opus despite Opus doing better on benchmarks. Better in the sense that Gemini will typically bang out a correct solution with best practices immediately and Opus might take half a day of correcting basic errors and misunderstandings.
If anyone knows what I'm talking about and knows a way to improve this please let me know. Claude still feels to me like an over-eager engineer that is more eager to get thing done fast than to do them correctly. It may be that with a task like theorem proving you can just burn credits and time because there's a clear answer.
I use the Gemini mcp and my CLAUDE.md has instructions to consult Gemini for strategy. Seems to work well. I have the lowest Claude plan so I don’t know how this would work vs Opus, for example.
Separately, I have been meaning to implement a cheating detector — have run into Claude modifying problem statements, adding axioms, etc.
Okay thanks I'll try that.
> have run into Claude modifying problem statements, adding axioms, etc.
Same here. I've thought about creating a utility that tells Claude it has to keep going until a test exits with nonzero status. But I'm concerned Claude would just fake everything to make the test pass.
Do you need an MCP for that? Why not tell Claude to call the Gemini CLI?
You could do that -- my understanding is that MCPs give Claude less to "think about" than having to use another program correctly, and therefore avoid context clutter. I could be wrong.
I'm curious, are there programs that generate challenges for AI, and why don't we read about them?
(I'm talking about regular programs, not other LLMs or GANs)
This matches my experience with Claude on other projects. Its output is much better when it can work against detailed specifications and tests, and Lean seems to be an extreme case of that.
My issue with formal proofs for code is that it assumes that the code is static and requirements are fixed and won't change. This assumption doesn't hold up in the vast majority of projects.
Surely LLMs can make this more practical but I think many projects suffer from a more mundane problem that they aren't architected properly. Without proper separation of concerns and proper abstraction, you won't get stability in the base modules, which means that you will have to re-evaluate and re-write tests and formal proofs constantly.
This issue seems hallucinated. In the real world we just update our proofs. https://sos-vo.org/node/69931
How hypocritical of someone preaching the virtues of formal proofs to be using a single example to try to prove the 'for all' case... As if it's feasible for all systems, for all requirements, under all possible financial constraints.
Oh yes, I'm the LLM here.
Maybe you should run your comments through your Coq proof-assistant before you post them!