> you can easily check whether it builds and passes tests.
This link were on HN recently: https://spectrum.ieee.org/ai-coding-degrades

  "...recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution."
The trend for LLM generated code is to build and pass tests but do not deliver functionality needed.

Also, please consider how SQLite is tested: https://sqlite.org/testing.html

The ratio between test code and code itself is mere 590 times (590 LOC of tests per LOC of actual code), it used to be more than 1100.

Here is notes on current release: https://sqlite.org/releaselog/3_51_2.html

Notice fixes there. Despite being one of the most, if not the most, tested pieces of software in the world, it still contains errors.

  > If you want to go further, you can even require the LLM to produce a machine checkable proof that the software is correct.
Haha. How do you reconcile a proof with actual code?

I've recently seen Opus, after struggling for a bit, implement an API by having it return JSON that includes instructions for a human to manually accomplish the task I gave it.

It proudly declared the task done.

I believe you have used Albanian [1] version of Opus.

[1] https://www.reddit.com/r/ProgrammerHumor/comments/1lw2xr6/hu...

Recent models have started to "fix" HTML issues with ugly hacks like !important. The result looks like it works, but the tech debt is considerable.

Still, it's just a temporary hindrance. Nothing a decent system prompt can't take care of until the models evolve.

> Haha. How do you reconcile a proof with actual code?

You can either proof your Rust code correct, or you can use a proof system that allows you to extract executable code from the proofs. Both approaches have been done in practice.

Or what do you mean?

Rust code can have arbitrary I/O effects in any parts of it. This precludes using only Rust's type system to make sure code does what spec said.

The most successful formally proven project I know, seL4 [1], did not extracted executable code from the proof. They created a prototype in Haskell, mapped (by hand) it to Isabelle, I believe, to have a formal proof and then recreated code in C, again, manually.

[1] https://sel4.systems/

Not many formal proof systems can extract executable C source.

> Haha. How do you reconcile a proof with actual code?

Languages like Lean allow you to write programs and proofs under the same umbrella.

As if Lean does not allow to circumvent it's proof system (the "sorry" keyword).

Also, consider adding code to the bigger system, written in C++. How would you use Lean to prove correctness of your code as part of the bigger system?

I mean, it's somewhat moot, as even the formal hypothesis ("what is this proof proving") can be more complex than the code that implements it in nontrivial cases. So verifying that the proof is saying the thing that you actually want it to prove can be near impossible for non-experts, and that's just the hypothesis; I'm assuming the proof itself is fully AI-generated and not reviewed beyond running it through the checker.

And at least in backend engineering, for anything beyond low-level algorithms you almost always want some workarounds: for your customer service department, for engineering during incident response, for your VIP clients, etc. If you're relying on formal proof of some functionality, you've got to create all those allowances in your proof algorithm (and hypothesis) too. And additionally nobody has really come up with a platform for distributed proofs, durable proof keys (kinda), or how to deal with "proven" functionality changes over time.