This is a naïve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.
For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
Exactly. The challenge isn’t getting the LLMs to make sure they validate their own code. It’s getting the LLMs to write the correct code in the first place. Adding more and more LLM-generated test code just obfuscates the LLM code even further. I have seen some really wild things where LLM jumps through hoops to get tests to pass, even when they actually should be failing because the logic is wrong.
The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.
When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.
It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.