The issues around training diffusion models are well known among researchers. They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself, and their lower quality compared to an equally-sized auto-regressive model (the usual one-token-at-a-time flow) is also a matter of broad consensus.
> They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself
I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.
Difficulty of scaling is not the only issue. Nobody is going to be particularly invested in scaling an architecture that has:
- consistently proven behind their auto-regressive counterparts in quality. Look at the dgemma benchmarks - pretty steep dropoffs and the more difficult the benchmark the worse the dropoff. That's not a good look and it's not like its some artifact of google's release. Every dllm is like this.
- And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.
>"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"
Put yourself in the shoes of all the labs, even open source ones. Why would you put much effort into this ?
> - And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.
But my entire point is about the reverse of this, the context of what I bring up is in single-user scenarios, which is where these diffusion models really make a large difference in performance.
Sure, I agree it's not a good fit for every single use case out there, everywhere. But after starting to play around with it closer myself, I think people are dismissing it a bit too quickly, at least if you're interested in running local models on your own hardware.
I don't think you're really getting the point I'm trying to make. Everyone training llms regularly cares about serving users at scale and quality per compute invested. It's not just about OpenAI or Anthropic or Google. Qwen, Deepseek, Moonshot, whatever. They all care about it very much and basically can't afford to take a step back in those areas.
Since training models is currently a very expensive procedure, diffusion llms are destined to be relegated to the occasional research artifact at best. As things stand, making a serious commitment to dllms is basically the equivalent of throwing money into a fire pit. No-one has money to waste like that. Things are expensive enough as is.
Alternate Architectures that do a much better job matching transformers in quality have gone nowhere but you expect one that is basically worse in every way the labs care about won't ? I'm not trying to 'dismiss' dllms. I'm actually interested in them for much the same reason you are. I'm just stating factors at play plainly.
Single user scenarios can also use MTP to make auto-regressive inference more compute-intensive with no loss of quality.