Hacker News

The 'who writes the tests' question is the crux of it. If the same model writes both the code and the tests, you're essentially asking it to find its own blind spots - which by definition it can't. On the convergence issue with the C compiler - I've hit the same pattern. The root cause in my experience is context accumulation: as the agent iterates on fixes, its context fills up with the history of its own failed attempts, and each new fix is increasingly biased by that history. It ends up chasing its own tail. Two things that helped: (1) isolating test-writing from code-writing across separate agents with no shared context, so the tests genuinely come from an independent interpretation of the spec, and (2) giving each fix attempt a clean context rather than letting the agent accumulate 20 rounds of 'I tried X and it didn't work.' The evolvability problem is harder though. That's less about verification and more about the model having no concept of future requirements. I don't have a good answer for that one yet.