That’s a cool paper - but it seems like it produces better debaters but not better content? To truly use RL’s strengths, it would be a battle of content (model or world representation) not mere token level battles.
I am not sure how that works at the prediction stage as language isn’t the problem here.
I think the hypothesis is that "debating" via the right adversarial word game may naturally select for better reasoning skills. There's some evidence for that in the paper, namely that it (monotonically!) improves the model's performance on seemingly unrelated reasoning stuff like the ARC dataset. Which is mysterious! But yeah, it's much too early to tell, although IIRC the results have been replicated already so that's something.
(by the way, I don't think "debating" is the right term for the SPAG game - it's quite subtle and isn't about arguing for a point, or rhetoric, or anything like that)