Didn't thinking tokens resolve the most problematic part of autoregressive models (the first few tokens set the constraints the model can't overcome later) and give it a massive advantage compared to diffusion models by showing the thinking trace? I can see diffusion models being used as a draft model to quickly predict a bunch of tokens and let the autoregressive model decide to use them or throw them away quickly, speeding it up considerably while keeping thinking traces available.
The reason I mentioned "purely autoregressive" is that realistically I expect hybrid diffusion + autoregressive models to be the first popular diffusion models. I could be wrong though. And diffusion models have other tricks like really easy integration with simple classifiers.
Check out this paper where they use diffusion during inference on the autoencoded prediction of an autoregressive model: https://openreview.net/forum?id=c05qIG1Z2B