> But RL algorithms do implement things like curiosity to drive exploration??
I hadn't read that paper, but yes using prediction failure as learning signal (and attention mechanism), same as we do, is what I had in mind, but it seems that to be useful it needs to be combined with online learning ability, so that having explored then next time one's predictions will be better.
It's easy to imagine LLM's being extended in all sorts of ad-hoc ways, including external prompting/scaffolding such as think step by step and tree search, which help mitigate some of the architectural shortcomings, but I think online learning is going to be tough to add in this way, and it also seems that using the model's own output as a substitute for working memory isn't sufficient to support long term focus and reasoning. You can try to script intelligence by putting the long-term focus and tree search into an agent, but I think that will only get you so far. At the end of the day a pre-trained transformer really is just a fancy sentence completion engine, and while it's informative how much "reactive intelligence" emerges from this type of frozen prediction, it seems the architecture has been stretched about as far as it will go.
I wasn't saying that DeepMind have abandoned RL in favor of LLMs, just that they are using RL in more narrow applications than AGI. David Silver at least still also seems to think that "Reward is enough" [for AGI], as of a few years ago, although I think most people disagree.
Hmm well the reason a pre-trained transformer is a fancy sentence completion engine is because that is what it is trained on, cross entropy loss on next token prediction. As I say, if you train an LLM to do math proofs, it learns to solve 4 out of the 6 IMO problems. I feel like you're not appreciating how impressive that is. And that is only possible because of the RL aspect of the system.
To be clear, i'm not claiming that you take an LLM and do some RL on it and suddenly it can do particular tasks. I'm saying that if you train it from scratch using RL it will be able to do certain well defined formal tasks.
Idk what you mean about the online learning ability tbh. The paper uses it in the exact way you specify, which is that it uses RL to play montezuma's revenge and gets better on the fly.
Similar to my point about the inference time RL ability of the alphaProof LLM. That's why I emphasized that RL is done at inference time, like each proof you do it uses to make itself better for next time.
I think you are taking LLM to mean GPT style models, and I am taking LLM to mean transformers which output text, and they can be trained to do any variety of things.
A transformer, regardless of what it is trained to do, is just a pass thru architecture consisting of a fixed number of layers, no feedback paths, and no memory from one input to the next. Most of it's limitations (wrt AGI) stem from the architecture. How you train it, and on what, can't change that.
Narrow skills like playing Chess (DeepBlue), Go, or math proofs are impressive in some sense, but not the same as generality and/or intelligence which are the hallmarks of AGI. Note that AlphaProof, as the same suggests, has more in common with AlphaGo and AlphaFold than a plain transformer. It's a hybrid neuro-symbolic approach where the real power is coming from the search/verification component. Sure, RL can do some impressive things when the right problem presents itself, but it's not a silver bullet to all machine learning problems, and few outside of David Silver think it's going to be the/a way to achieve AGI.
I agree with you that transformers are probably not the architecture of choice. Not sure what that has to do with the viability of RL though.