> The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use.

Ah, hence the "HF" angle.

RLHF really has a different goal - it's not about rewarding/encouraging reasoning, but rather rewarding outputs that match human preferences for whatever reason (responses that are more on-point, or politer, or longer form, etc, etc).

The way RLHF works is that a smallish amount of feedback data of A/B preferences from actual humans is used to train a preference model, and this preference model is then used to generate RL rewards for the actual RLHF training.

RLHF has been around for a while and is what tamed base models like GPT 3 into GPT 3.5 that was used for the initial ChatGPT, making it behave in more of an acceptable way!

RLVR is much more recent, the basis of the models that do great at math and programming. If you talk about reasoning models being RL trained then it's normally going to imply RLVR, but it seems there's a recent trend of people calling it RLVR to be more explcit.