"Next token prediction" is an interface, not an algorithm. A process that "predicts next tokens" can be arbitrarily complex or simple, and arbitrarily capable or incapable of performing a given task.
Saying that an LLM can or can't do something because it's a "token predictor" is a category error. The interface isn't a hard limit.
I'm not sure if it's has any real bearing on real-world performance, but technically next token prediction makes it an online algorithm and they can be provably worse than (good) offline algorithms.
The word "prediction" still holds a lot of weight. LLM's only can predict what has been written. This is a hard limit.
For something like "a hard limit" to hold, LLMs must be restricted to only reproducing existing text. This is utterly false even for base models - their basin seems to be "permutations loosely inspired by existing text".
And that's before all the post-training comes in.
What's the "limit" there?