No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.