Fun facts, this paper is cited by Simple Self-Distillation (SSD) paper by Apple [1],[2]. I think it is a bad naming scheme due to the very common SSD namesake and the fact that it belongs to on-policy self-distillation [3]. But again according to the authors their proposed solution is simple because "SSD uses only temperature-shifted samples from the base model and standard cross-entropy training,without privileged context, feedback-conditioned teachers,or auxiliary supervision."
The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.
IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.
With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) [6]. This fine-tuning process can be performed with the minimum resources of 8 H200 or even 4 H100 GPUs as reported respectively in either of the papers [4],[5]. Let's see if we can replicate that with much cheaper arrangements consisting of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement with a total of 1 TB RAM (128 GB x 8) [7],[8].
IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.
[1] Embarrassingly simple self-distillation improves code generation (2026 - 201 comments):
https://news.ycombinator.com/item?id=47637757
[2] Embarrassingly Simple Self-Distillation Improves Code Generation:
https://arxiv.org/abs/2604.01193
[3] Comment on "Embarrassingly simple self-distillation improves code generation":
https://news.ycombinator.com/item?id=47644784
[4] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[5] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:
https://arxiv.org/abs/2601.18734
[6] Why domain specific LLMs won't exist: an intuition (2026 - 4 comments):
https://news.ycombinator.com/item?id=47649167
[7] NVIDIA DGX Spark Review The GB10 Machine is so Freaking Cool:
https://www.servethehome.com/nvidia-dgx-spark-review-the-gb1...
[8] BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster:
https://www.servethehome.com/big-cluster-little-power-the-8x...
Very comprehensive, will give this a read, thanks man!