Hacker News

I get the point of the article, but I think it makes a bit of a strawman to drive the point across.

Yes, RLHF is barely RL, but you wouldn't use human feedback to drive a Go game unless there was no better alternative; and in RL, finding a good reward function is the name of the game; once you have that, you have no reason to prefer human feedback, especially if it is demonstrably worse. So, no, nobody would actually "prefer RLHF over RL" given the choice.

But for language models, human feedback IS the ground truth (at least until we find a better, more mathematical alternative). If it weren't and we had something better, then we'd use that. But we don't. So no, RLHF is not "worse than RL" in this case, because there 'is' no 'other' RL in this case; so, here, RLHF actually is RL.