I think people are misunderstanding reward functions and LLMs.
LLMs don't actually have a reward system like some other ML models.
I think people are misunderstanding reward functions and LLMs.
LLMs don't actually have a reward system like some other ML models.
They are trained with one, and when you look at DPO you can say they contain an implicit one as well.