This is partially the reason why we see LLM's "plateauing" in the benchmarks. For the lmsys Arena, for example, LLM's are simply judged on whether the user liked the answer or not. Truth is a secondary part of that process, as are many other things that perhaps humans are not very good at evaluating. There is a limit to the capacity and value of having LLM's chase RLHF as a reward function. As Karpathy says here, we could even argue that it is counter productive to build a system based on human opinion, especially if we want the system to surpass us.

RLHF really isn't the problem as far as surpassing human capability - language models trained to mimic human responses are fundamentally not going to do anything other than mimic human responses, regardless of how you fine-tune them for the specific type of human responses you do or don't like.

If you want to exceed human intelligence, then design architectures for intelligence, not for copying humans!