> This demonstrates a technique whereby a smaller/older/cheaper model has been used to improve the output of a larger model. This is backwards (as far as I understand). The current SOTA technique typically sees enormous/expensive models training smaller cheaper models.

There are two separate aspects here. In this paper they improve the software around the model, not the model itself. What they're saying is that the software improvements carried over to other models, so it wasn't just optimising around model-specific quirks.

What you're describing with training large LLMs first is usually called "distillation" and it works on training the smaller LLM to match the entire distribution of tokens at once (hence it's faster in practice).