> The code must pass property-based tests

Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.

There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.

The 'who writes the tests' question is the crux of it. If the same model writes both the code and the tests, you're essentially asking it to find its own blind spots - which by definition it can't. On the convergence issue with the C compiler - I've hit the same pattern. The root cause in my experience is context accumulation: as the agent iterates on fixes, its context fills up with the history of its own failed attempts, and each new fix is increasingly biased by that history. It ends up chasing its own tail. Two things that helped: (1) isolating test-writing from code-writing across separate agents with no shared context, so the tests genuinely come from an independent interpretation of the spec, and (2) giving each fix attempt a clean context rather than letting the agent accumulate 20 rounds of 'I tried X and it didn't work.' The evolvability problem is harder though. That's less about verification and more about the model having no concept of future requirements. I don't have a good answer for that one yet.