Probably not. Everyone will still need a lot of reasoning tokens and tool calls. Running the tests for every round is tiring but must be done.