The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.
Speed is the big advantage. Autoregressive when doing local inference is mostly memory bound; you're doing one token at a time, for each token you need to load all weights. MTP helps a bit by allowing you to draft tokens in a smaller model and then verify them in parallel with the larger model, allowing you to do a few computations for every memory load, but because you're still doing tokens sequentially and need to discard invalid drafted tokens, you can only get so much speedup.
For hosted models, however, you can batch many token generations together, fully utilizing all of the compute while no longer being bottlenecked on memory bandwidth. So they are already operating at close to max efficiency.
So, diffusion kind of loses its beneifit in hosted models. Sure, maybe you could pay more to have slightly lower latency responses by doing diffusion for one user at a time instead of autoregressive for many in parallel. But given that it also reduces accuracy, it's hard to see where you'd really want that. Unless they're able to bring it up to par with autoregressive, it seems like it's a bit of a dead out outside of local models where you're generally just doing one thing at a time.
I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.
My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.
Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.
Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.
So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?
Because if not... sounds like diffusion models have a lot of space to thrive.
---
Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.
Almost certainly not if things remain as they are. The reason there's been little traction is the quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale. So this is only attractive for local model development and almost everyone training local models still care about pound for pound quality and inference efficiency at scale.
It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.
This may be the future of local models.
The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.
Speed is the big advantage. Autoregressive when doing local inference is mostly memory bound; you're doing one token at a time, for each token you need to load all weights. MTP helps a bit by allowing you to draft tokens in a smaller model and then verify them in parallel with the larger model, allowing you to do a few computations for every memory load, but because you're still doing tokens sequentially and need to discard invalid drafted tokens, you can only get so much speedup.
For hosted models, however, you can batch many token generations together, fully utilizing all of the compute while no longer being bottlenecked on memory bandwidth. So they are already operating at close to max efficiency.
So, diffusion kind of loses its beneifit in hosted models. Sure, maybe you could pay more to have slightly lower latency responses by doing diffusion for one user at a time instead of autoregressive for many in parallel. But given that it also reduces accuracy, it's hard to see where you'd really want that. Unless they're able to bring it up to par with autoregressive, it seems like it's a bit of a dead out outside of local models where you're generally just doing one thing at a time.
I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.
My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.
Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.
Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.
So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?
Because if not... sounds like diffusion models have a lot of space to thrive.
---
Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.
Almost certainly not if things remain as they are. The reason there's been little traction is the quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale. So this is only attractive for local model development and almost everyone training local models still care about pound for pound quality and inference efficiency at scale.
It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.