Okay thanks I'll try that.
> have run into Claude modifying problem statements, adding axioms, etc.
Same here. I've thought about creating a utility that tells Claude it has to keep going until a test exits with nonzero status. But I'm concerned Claude would just fake everything to make the test pass.