There's a vein of research which interprets self-attention as a kind of gradient descent and says that LLMs have essentially pre-solved indefinitely large 'families' or 'classes' of tasks, and the 'learning' they do at runtime is simply gradient descent (possibly Newton) using the 'observations' to figure out which pre-solved instance they are now encountering; this explains why they fail in such strange ways, especially in agentic scenarios - because if the true task is not inside those pre-learned classes, no amount of additional descent can find it after you've found the 'closest' pre-learned task to the true task. (Some links: https://gwern.net/doc/ai/nn/transformer/attention/meta-desce... )
I wonder if this can be interpreted as consistent with that 'meta-learned descent' PoV? If the system is fixed and is just cycling through fixed strategies, that is what you'd expect from that: the descent will thrash around the nearest pre-learned tasks but won't change the overall system or create new solved tasks.