Hacker News

Let's say, hypothetically, we do enough RLHF that a model can imitate humans at the highest level. Like, the level of professional researchers on average. Then we do more RLHF.

Maybe, by chance, the model produces an output that is a little better than its average; that is, better than professional researchers. This will be ranked favorably in RLHF.

Repeat this process and the model slowly but surely surpasses the best humans.

Is such a scenario possible in practice?