While I agree to Karpathy and I also had a "wut? They call this RL? " reaction when RLHF was presented as an method of CHATGPT training, I'm a bit surprised by the insight he makes because this same method and insight have been gathered from "Learning from Human preference" [1] from none other than openAI, published in 2017.

Sometimes judging a "good enough" policy is order of magnitudes more easier than formulating an exact reward function, but this is pretty much domain and scope dependent. Trying to estimate a reward function in those situations, can often be counter productive because the reward might even screw up your search direction. This observation was also made by the authors (researchers) of the book "Myth of the objective"[2] with their picbreeder example. (the authors so happens to also work for OpenAI.)

When you have a well defined reward function with no local suboptima and no cost in rolling out faulty policies RL work remarkably well. (Alex Ipran described this well in his widely cited blog [3])

Problem is that this is pretty hard requirements to have for most problems that interact with the real world (and not internet, the artificial world). It's either the suboptima that is in the way (LLM and text), or rollout cost (running GO game a billion times to just beat humans, is currently not a feasible requirement for a lot of real world applications)

Tangentially, this is also why I suspect LLM for planning (and understanding the world) in the real world have been lacking. Robot Transformer and SayCan approaches are cool but if you look past the fancy demos it is indeed a lackluster performance.

It will be interesting to see how these observations and Karpathys observations will be tested with the current humanoid robot hype, which imo is partially fueled by a misunderstanding of LLMs capacity including what Karpathy mentioned. (shameless plug: [4])

[1] https://openai.com/index/learning-from-human-preferences/

[2] https://www.lesswrong.com/posts/pi4owuC7Rdab7uWWR/book-revie...

[3] https://www.alexirpan.com/2018/02/14/rl-hard.html

[4] https://harimus.github.io//2024/05/31/motortask.html