Hacker News

rao-v 4 hours ago [ - ]

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

girvo an hour ago [ - ]

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

rao-v 37 minutes ago [ - ]

Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)

ACCount37 an hour ago [ - ]

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

rao-v 35 minutes ago [ - ]

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

ACCount37 27 minutes ago [ - ]

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

thisisaman408 2 hours ago [ - ]

[dead]