Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.

Not really. Unlike with physical goods like batteries, the hardware for training a diffusion vs an autoregressive language model is more or less exactly the same.

Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.

At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)

Same with digital vs analog

Digital came later but beat analog at almost everything?