From Jan 2026.

This is very interesting:

"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."

Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.

I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.

Fun facts, this paper is cited by Simple Self-Distillation (SSD) paper by Apple [1],[2]. I think it is a bad naming scheme due to the very common SSD namesake and the fact that it belongs to on-policy self-distillation [3]. But again according to the authors their proposed solution is simple because "SSD uses only temperature-shifted samples from the base model and standard cross-entropy training,without privileged context, feedback-conditioned teachers,or auxiliary supervision."

The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.

IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.

With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) with the minimum resources of 8 or even 4 H200 GPUs [6]. Heck let's see if we can replicate that with much cheaper arrangement consist of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement wirhna total 1 TB RAM (128 GB x 8) [7],[8].

IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.

[1] Embarrassingly simple self-distillation improves code generation (2026 - 201 comments):

https://news.ycombinator.com/item?id=47637757

[2] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

[3] Comment on "Embarrassingly simple self-distillation improves code generation":

https://news.ycombinator.com/item?id=47644784

[4] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[5] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[6] Why domain specific LLMs won't exist: an intuition (2026 - 4 comments):

https://news.ycombinator.com/item?id=47649167

[7] NVIDIA DGX Spark Review The GB10 Machine is so Freaking Cool:

https://www.servethehome.com/nvidia-dgx-spark-review-the-gb1...

[8] BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster:

https://www.servethehome.com/big-cluster-little-power-the-8x...

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.

The policy is how you select your actions -- in this case, the next token. It can be random, but it doesn't have to be. "Deterministically choose the best action" is a valid policy (we would call it the greedy policy), as long as you have some other means of injecting stochasticity so the model explores the space. Uniform random is also a valid policy, as is always selecting the same token (it obviously wouldn't be very performant, and would defeat the purpose here, but it might be fine in, for example, a multi-armed bandit scenario). Most of the time, the policy is a parameterized distribution, and we want to learn the model parameters that maximize some measure of success (the reward component).

Off-policy versus on-policy refers to what data the model is trained on. On-policy training is where the training data is collected by the policy. Off-policy training is where the data was collected by a different sampling process (e.g. we have a standard dataset that we're going to use for supervised training).

It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.

This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.

What is this comment? It’s an RL paper, these are standard RL terms

It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.

> one could just call it model output.

That would be incorrect. My other reply attempts to address this.

But the probability vector is the output of the LLM, no?

Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.

“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.

The paper discusses “on-policy” and “off-policy” training, which is central to its idea.

Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.

On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.