for "does it run" cases, you can ask the model to try again, give it higher temperature, show it the traceback errors, (and maybe intermediate variables?) or even ask it to break up the problem into smaller pieces and then try to translate that.
for testing, if you use something like quickcheck, you might find bugs that you wouldn't otherwise find.
when it comes to idiomatic, I'm not sure - but if we're at the point that gpt is writing code that works, do we really care? as long as this code is split into many small pieces, we can just replace the piece instead of trying to understand/fix it if we can't read it. in fact, maybe there's a better language that is human readable but better for transformers to write and maintain.
For "does it run" I'm not talking about how do we test that it does, but how do we either score or compare two+ options?
> when it comes to idiomatic, I'm not sure - but if we're at the point that gpt is writing code that works, do we really care?
Yes - it's certainly preferable. You may prefer working over neat, but working and neat over working but insane spaghetti code.
Remember this is about training the models, not about using them later. How do we tell, while training, which option was better to push it towards good results?