I agree mostly. They are all that you say, but if you think about the conditional distribution that you are learning, there is nothing preventing us in principle from mapping different contexts to the same responses. It is rather a practical limitation that we don’t have sufficient tools of shaping these distributions very soundly. All we can do is throw data at them and hope that they generalize to similar contexts.

We have observed situations where agentic LLM traces on verifiable problems with deterministic (greedy) decoding lead to either completely correct or completely wrong solutions depending on the minutes on the clock which are printed as coincidental output of some tool that the LLM used.

I think there may be some mild fixes to current models available , for example it is worrying that the attention mechanism can never fully disregard any token in the input, because the softmax will always assign a > 0 weight everywhere (and the NN has no way of setting a logit to -infinity). This directly causes that it is extremely difficult for the LLM to fully ignore any part of the context reliably.

However Yann LeCun actually offers some persuasive arguments that autoregressive decoding has some limitations and we may need something better.