I think it's the same underlying principle of learning the "joint distribution of things humans have said". Whether done autoregressively via LLMs or via diffusion models, you still end up learning this distribution. The insight seems to be the crazy leap that this is A) a valid thing to talk about and B) that learning this distribution gives you something meaningful.
The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".