Hacker News

>True, but the capabilities and knowledge of that model are also frozen in time, so the value of that model declines over time.

Correction: The capabilities and knowledge of that model can be improved via self-distillation, so the value of that model increases over time.

This is where I think self-distillation is the main way forward, and probably the second best thing ever happened to AI/LLM after the transformer.

Based on self-distillation, the value of the open weights models will incease over time for sub-specialization through post-training and fine-tuning.

Please check these very promising recent works and results from MIT/ETH, UCLA and Apple [1],[2,[3]. For example the MIT/ETH self-distillation approach was demonstrated by a single H200 GPU. Apple approach is even simpler that it's simply called Simple Self-Distillation (SSD), pun intended.

[1] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[2] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[3] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193