Dumb question: can you train a model to predict the next byte of ANOTHER MODEL

So apply this same logic to compressing a bigger model within a smaller model

I know this is absolutely regarded, but humour me please

Not dumb at all. It's a whole field of active research - Speculative Decoding. A recent paper goes one level deeper with Speculative Speculative Decoding - https://arxiv.org/abs/2603.03251

Is model distillation also related?

Oh man awesome! I’m so S-M-R-T

Compression is such an interesting field

If there's any redundancy in the model that can be compressed (parallel to how RLE is used to compress the static Huffman tree in FLATE) that's possible, but it's not necessary if the model is being trained on the input dynamically, like what Bellard's NNCP does.