For all these "open questions" you might have it is better to ask the LLM write a benchmark and actually see the numbers. Why rush, spend 10 minutes, you will have a decision backed by some real feedback from code execution.

But this is just a small part from a much grander testing activity that needs to wrap the LLM code. I think my main job moved to 1. architecting and 2. ensuring the tests are well done.

What you don't test is not reliable yet, looking at code is not testing, it's "vibe-testing" and should be an antipattern, no LGTM for AI code. We should rely on our intuition alone because it is not strict enough, and it makes everything slow - we should not "walk the motorcycle".

Ok. I also have the intuition that more tests and formal specifications can help there.

So far, my biggest issue is, when the code produced is incorrect, with a subtle bug, then I just feel I have wasted time to prompt for something I should have written because now I have to understand it deeply to debug it.

If the test infrastructure is sound, then maybe there is a gain after all even if the code is wrong.