A cheap DIY way of achieving the same thing as RLHF is to fine tune the model to append a score to its output every time.

Remember: The reason we need RLHF at all is that we cannot write a loss function for what makes a good answer. There are just many ways a good answer could look like, which cannot be calculated on the basis of next-token-probability.

So you start by having your vanilla model generate n completions for your prompt. You the. manually score them. And then those prompt => (completion,score) pairs become your training set.

Once the model is trained, you may find that you can cheat:

Because if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score.

> if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score

But you need a model to generate score from answer, and then fine-tune another model to generate answer conditioned on score. The first time the score is at the end and the second time at the beginning. It's how DecisionTransformer works too, it constructs a sequence of (reward, state, action) where reward conditions on the next action.

https://arxiv.org/pdf/2106.01345

By the same logic you could generate tags, including style, author, venue and date. Some will be extracted from the source document, the others produced with classifiers. Then you can flip the order and finetune a model that takes the tags before the answer. Then you got a LLM you can condition on author and style.

I had an idea similar to this for a model that allows you to parameterize a performance vs. accuracy ratio, essentially an imbalanced MoE-like approach where instead of the "quality score" in your example, you assign a score based on how much computation it used to achieve that answer, then you can dynamically request different code paths be taken at inference time.

That works in the same way as actor-critic pair, right? Just all wrapped in the same network/output?

Not the same, it will get you worse output and is harder to do right in practice.

[dead]