Hacker News

I don't know about linear models, but this kind of hierarchical modelling is quite a common idea in speech research. For example, OpenAI's Jukebox (2020) [1], which uses a proto-neural audio codec, has three levels of encoding that get coarser and coarser. They use a language model to predict continuations in the coarsest level and then have models to upscale to the finer levels and finally back to audio.

The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]

[1] https://arxiv.org/abs/2005.00341

[2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi...