Hacker News

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.