Hacker News

What you're describing is (pre-)training. Distillation requires richer labels, the probability distribution over tokens (it would be logits rather than probabilities but that's not important). From a chat transcript you can only understand the argmax/most likely token of that distribution (and only if the API allows you to set the temperature to 0). It's not impossible for an API to give you that but they won't if they don't want you distilling their models.

The intuition is that distillation exploits not only the "right" answer but the relationship between answers (what's the second most right answer? the third? etc).