> By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss.
Isn't this how the Decision Transformer works? I don't see it in the references, so I'll be curious to compare the papers in more depth.
https://arxiv.org/abs/2106.01345
> By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return.
Lately it has crossed my mind that I haven't seen DT brought up much lately, it seemed really interesting when it was first published but I haven't read much follow-up work.