The current music generators use next token prediction, like LLMs, not image denoising.

[0] https://arxiv.org/abs/2503.08638 (grep for "audio token")

[1] https://arxiv.org/abs/2306.05284