Hacker News

Note that human preference isn't universal. RLHF is mostly frowned upon by the open source LLM community since it typically involves aligning the model to the preference of corporate manager humans, i.e. tuning for censorship and political correctness to make the model as bland as possible so the parent company doesn't get sued.

For actual reinforcement learning with a feedback loop that aims to increase overall performance the current techniques are SPPO and Meta's version of it [0] that slightly outperforms it. It involves using a larger LLM as a judge though, so the accuracy of the results is somewhat dubious.

[0] https://arxiv.org/pdf/2407.19594