I don't know about linear models, but this kind of hierarchical modelling is quite a common idea in speech research. For example, OpenAI's Jukebox (2020) [1], which uses a proto-neural audio codec, has three levels of encoding that get coarser and coarser. They use a language model to predict continuations in the coarsest level and then have models to upscale to the finer levels and finally back to audio.
The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]
[1] https://arxiv.org/abs/2005.00341
[2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi...