I thought I'd read a lot of these threads this year, and also discussed off-site the use of coding agents and the technology behind them; but this is genuinely the first time I've seen the term "RLVR".

RLVR "reinforcement learning for verifiable rewards" refers to RL used to encourage reasoning towards achieving long-horizon goals in areas such as math and programming, where the correctness/desirability of a generated response (or perhaps an individual reasoning step) can be verified in some way. For example generated code can be verified by compiling and running it, or math results verified by comparing to known correct results.

The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use.

> The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use.

Ah, hence the "HF" angle.

RLHF really has a different goal - it's not about rewarding/encouraging reasoning, but rather rewarding outputs that match human preferences for whatever reason (responses that are more on-point, or politer, or longer form, etc, etc).

The way RLHF works is that a smallish amount of feedback data of A/B preferences from actual humans is used to train a preference model, and this preference model is then used to generate RL rewards for the actual RLHF training.

RLHF has been around for a while and is what tamed base models like GPT 3 into GPT 3.5 that was used for the initial ChatGPT, making it behave in more of an acceptable way!

RLVR is much more recent, the basis of the models that do great at math and programming. If you talk about reasoning models being RL trained then it's normally going to imply RLVR, but it seems there's a recent trend of people calling it RLVR to be more explcit.

> generated code can be verified by compiling and running it

I think this gets to the crux of the issue with LLMs for coding (and indeed 'test orientated development'). For anything beyond a most basic level of complexity (i.e. anything actually useful), code cannot be verified by compiling and running it. It can only be verified - to a point - by skilled human inspection/comprehension. That is the essence of code really, a definition of action, given by humans, to a machine for running with /a prior/ unenumerated inputs. Otherwise it is just a fancy lookup table. By definition then not all inputs and expected outputs can be tabulated, tested for, or rewarded for.

I was talking about the RL training process for giving these models coding ability in the first place.

As far as using the trained model to generate code, then of course it's up to the developer to do code reviews, testing, etc as normal, although of course an LLM can be used to assist writing test cases etc as well.