Maybe because distilling small models from bigger ones that you control gives you better small models than fine-tuning from bigger models you don't control?
(I am not claiming it is the case, but stating this as an assumption)
Maybe because distilling small models from bigger ones that you control gives you better small models than fine-tuning from bigger models you don't control?
(I am not claiming it is the case, but stating this as an assumption)