As a function of energy, it’s provably impossible for a next word predictor with a constant energy per token to come up with anything that’s not in its training. (I think Yann LeCun came up with this?)
It seems to me RL was quite revolutionary (especially with protein folding/AlphaGo) - but using a minimal form of it to solve a training (not prediction) problem seems rather like bringing a bazooka to a banana fight.
Using explore/exploit methods to search potential problem spaces might really help propel this space forward. But the energy requirements do not favor the incumbents as things are now scaled to the current classic LLM format.