Karpathy is _much_ more knowledgeable about this than I am, but I feel like this post is missing something.
Go is a game that is fundamentally too complex for humans to solve. We've known this since way back before AlphaGo. Since humans were not the perfect Go players, we didn't use them to teach a model- we wanted the model to be able to beat humans.
I dont see language being comparable. the "perfect" LLM imitates humans perfectly, presumably to the point where you can't tell the difference between LLM generated text, and human generated text. Maybe it's just as flexible as the human mind is too, and can context switch quickly, and can quickly swap between formalities, tones, and slangs. But the concept of "beating" a human doesn't really make much sense.
AlphaGo and Stockfish can push forward our understandings of their respective games, but an LLM cant push forwards our boundary of language. this is because it's fundamentally a copy-cat model. This makes RLHF make much more sense in the LLM realm than the Go realm.
One of the problems lies in the way RLHF is often performed: presenting a human with several different responses and having them choose one. The goal here is to create the most human-like output, but the process is instead creating outputs humans like the most, which can seriously limit the model. For example, most recent diffusion-based image generators use the same process to improve their outputs, relying on volunteers to select which outputs are preferable. This has lead to models that are comically incapable of generating ugly or average people, because the volunteers systematically rate those outputs lower.
The distinction is that LLMs are not used for what they are trained for in this case. In the vast majority of cases someone using an LLM is not interested in what some mixture of openai employees ratings + average person would say about a topic, they are interested in the correct answer.
When I ask chatgpt for code I don't want them to imitate humans, I want them to be better than humans. My reward function should then be code that actually works, not code that is similar to humans.
I don’t think it is true that the perfect LLM emulates a human perfectly. LLMs are language models, whose purpose is to entertain and solve problems. Yes, they do that by imitating human text at first, but that’s merely a shortcut to enable them to perform well. Making money via maximizing their goal (entertain and solve problems) will eventually entail self-training on tasks to perform superhumanly on these tasks. This seems clearly possible for math and coding, and it remains an open question about what approaches will work for other domains.
In a sense GPT-4 is self-training already, in that it's bringing in money for OpenAI which is being spent on training further iterations. (this is a joke)
This is a great comment. Another important distinction, I think, is that in the AlphaGo case there's no equivalent to the generalized predict next token pretraining that happens for LLMs (at least I don't think so, this is what I'm not sure of). For LLMs, RLHF teaches the model to be conversational, but the model has already learned language and how to talk like a human from the predict next token pretraining.
Let's say, hypothetically, we do enough RLHF that a model can imitate humans at the highest level. Like, the level of professional researchers on average. Then we do more RLHF.
Maybe, by chance, the model produces an output that is a little better than its average; that is, better than professional researchers. This will be ranked favorably in RLHF.
Repeat this process and the model slowly but surely surpasses the best humans.
Is such a scenario possible in practice?