Stupid question: I was under the impression that these models were trained on PB of data. Surely the amount of questions/response they can extract from querying a bigger model (Claude) is fairly modest. How is it not a drop vs the training dataset?

It's not about how big your dataset is - it's about how you use it.

I jest, but I'm also completely serious. 1T tokens from Claude can teach a model something 1T tokens scraped from the open web can't. Things like "how an LLM can problem solve effectively", or "how an LLM should use tools", or "how to construct reasoning chains", or "when to double check", or "what innate capabilities an LLM can or can't rely on".

Those are valuable things that Anthropic's own team spent a lot of effort post-training into Claude. Distillation allows them to be extracted and transferred to an otherwise unremarkable base model.

Unremarkable base model will remain an unremarkable fine-tuned model that memorised a couple thousand of input-output pairings.

Ha ha, as if.

Base models have a lot of capabilities - arranged in all the wrong ways for high performance reasoning and problem-solving. The power of fine tuning on "a couple thousand of input-output pairings" is that it can fix some of that. If your pairings are very well chosen, that is.

If that were the case, Anthropic wouldn't be throwing a fit over distillation "attacks".

Why? They often don't make sense. They send DMCA takedowns over materials they can't even copyright, for example. They fessed up to creating shadow libraries that they didn't even use in their training corpus, resulting in the largest copyright settlement ever. Your reasoning is flawed.

Yes, neural networks are famously poor at generalising.

They are poor at generalising from a small number of examples; this is why the real generalisation power is achieved in pre-training.

Can you back up this with hard data and evidence?

Most research converges to the idea that RL on synthetic data makes models worse, not better.

If what you claim was anywhere near that relevant, than we would've long achieved singularity by simply feeding increasingly better output to the training of the next model in a loop. Yet this doesn't work.

25 million turns on Claude output is a small amount, yet an expensive one (we talking hundreds of $ millions) that is better spent on compute.

There's no evidence such a process works, but I'd like to know more if I'm wrong.

> Most research converges to the idea that RL on synthetic data makes models worse, not better.

You are missing a mountain of nuance by generalizing the existence of a hole there.

Back up what? That distilling from a more capable model into a less capable model pulls the student model's capabilities up? What. Why the fuck is this even a question.

Look up literally any distillation works. Because this is just distillation but on one-hot token chains instead of richer logit KL proxies.

And no, I'm not claiming than you can "close the loop" and get RSI on the cheap just by distilling forever. I'm claiming that distillation is a very cheap way to bring the performance of a less capable model closer to that of a more capable model. It doesn't give you "a more capable model" out of thin air.

Which is why Chinese labs rely on Anthropic to provide that "more capable model" to them. They take the capabilities Anthropic trained for the hard way, and train for them the easy way.

It's a "fast follower"/"improved capability density" trick, not a "singularity tomorrow" trick. There are a few "distillation pump" tricks that get closer to what you have in mind, but they're still more about "extract more training signal out of the same set of data" than about "unbounded RSI".

so the way llms work in the first place. training on original research that was acquired the hard way.

Okay, you have no data nor evidence nor a paper backing this claim, it's just speculation.

You want to sell me the idea they are spending hundreds of millions to get unchecked Q/As with reasoning redacted and without checks on the output quality to do what exactly?

Have a shallow pointless bunch of expensive data to get slightly better RL? It's expensive and pointless.

Data has shown again and again that synthetic input/output does not benefit models in RL, it may even make the output worse.

Also, you have a giant bias.

The chinese are the only ones releasing models and research papers in the open from which American labs benefit 24/7 (DeepSeek has been copied by all US providers).

And you want to sell me this ridiculous idea of the giant return of spending hundreds of millions on unredacted pointless QAs?

What the fuck. Are you a literal, honest to god distillation denier? Straight up "wake up sheeple, model distillation isn't real"?

I've seen plenty of things in the dumpsters of AI discourse, but this got to be among the most baffling.

Yes, there are "giant returns" on distilling from a more capable model into a less capable model. And even more so when the more capable model was trained for something you want and lack. Like: better coding performance.

Someone like OpenAI had to RLVR for it the hard way (and if you think "distillation is expensive", wait till you hear how many bits per rollout hardcore RLVR gets you), but you get to peek into the results of their work and copy them for yourself.

Also, Anthropic didn't redact model reasoning until Mythos. OpenAI started with o1, but Claude had reasoning chains accessible for a long time. Which is why Anthropic was more targeted than OpenAI.

So we're meant to believe that only US companies have the intelligence and/or access to manpower to generate their own reasoning data? Does China have a population deficit? Maybe China has too high wages to pay people to generate reasoning data?

The US companies bootstrapped themselves from one model generation to the next, partly by using the previous generation to generate synthetic data, etc, and partly by paying people to hand generate training data for them. Why do you apparently assume that the Chinese can't do the exact same thing?!

Surely "coding performance" is by far the easiest thing to generate your own RLVF data for, since it has trivial verifiable rewards - does the code compile and do what you want.

RLVR is the poster child for model distillation. Because: have you considered just how many tokens does a model have to generate before you can check "does the code compile and do what you want"?

You generate 90000 tokens worth of rollout and get a verifiable reward once. RLVR is fucking expensive! It's worth it, because it often unlocks capability advances that other things don't. But it's still fucking expensive. RLVR eats compute like nothing else.

So, if someone used a lot of RLVR to improve a capability? Just distill from that "someone" and get a similar improvement for a fraction of the price! Then you can do your own RLVR from THAT cheap starting point, if you want to.

"Human domain experts" is a similar niche. Let's say hypothetical "EconomicsAI" hired some $200 per hour human economists to make training data for their "EconGPT" AI. What's cheaper - hiring your own $200 per hour economists, or using a bunch of "$10 per 1M tokens" outputs of EconGPT to bring your own model in line with what EconGPT can do?

Even synthetics can be expensive, because while synthetic tokens themselves are relatively cheap, the applied AI knowledge one needs to make high quality synthetics that improve task performance and don't backfire on you isn't. Again: distillation bypasses a lot of that - by cribbing from the outputs of a model someone has already done that for. Allowing you to get more oomph for cheaper, and spend your R&D effort elsewhere.

Your training cost argument makes no sense. It doesn't matter whether you are using human written code or someone else's LLM generated code to train on - you are going to be RL training on it, so your RL training cost is the same.

There is a data cost argument, especially if you are paying for human generated data, although I'm not sure how applicable that is to coding.

If your claim is so solid, you'll have no problem pointing out data or evidence.

DeepSeek R1 was a famous case - not only it briefly beat then-SOTA on the cheap, it was also released with distilled versions that preserved bulk of the improvements but could be run on higher-end consumer hardware.

And of course Gemma models are said to be distillations of Gemini.

The distillation you're talking about is about cutting the number of weights, it has nothing to do with extracting QAs from another model.

There are multiple stages of training, and the data/compute mix at each are quite different and produce different "layers" of intelligence.

The pretraining stage is the first stage which consists of "next token prediction" on the entire internet, PB of tokens, etc. This is what most people think of when they think of training LLMs, however it produces a "base model" which is not really "intelligent", but rather much like a blurry JPEG of all human language and knowledge. You cannot really talk to such a model; it will simply complete your prompt by producing both sides of the conversation. Note however at some level the training has encoded enough structure through compression that it is able to simulate all sorts of phenomena, from human conversations to code. The great R&D difficulty here is to scale pretraining so that it can proceed smoothly in vast distributed datacenters in a fault-tolerant manner.

The next few stages are collectively called post-training, and typically consist of supervised fine-tuning, then reinforcement learning.

In supervised fine-tuning, the model is further trained to predict the next token, but on a much more focused data set of natural language conversations where the "assistant" and "user" turns are explicitly delineated with special tokens. The output of this stage is a model which is capable of carrying on proper conversations, but typically with no ability to creatively problem-solve, and less of a personality. The data and compute are many orders of magnitude smaller than in pretraining.

The reinforcement learning stage used to be a small part of model training, but ever since AI-assisted coding took off, it has become larger and larger chunk of training. In recent models, the compute spend on RL has allegedly come to rival or even exceed that of pretraining [1], which is a bit scary because RL is classically what lead to sci-fi like AIs which are extremely good at accomplishing goals to the detriment of everything else.

The way that RL works is that you put an instance of your model in some environment (such as a VM containing a git repository) and give it a task (such as fix the linked github issue). The model will then generate a bunch of attempts to solve the task which we call "trajectories", in most cases there is either an objective measure of the task success (such as passing the tests), or a fuzzy measure (such as having another LLM look at the results and provide a score). This is called the reward, and the model will learn slowly by producing trajectories that receive reward. It can actually be quite hard to prevent "reward hacking" from the model here and the rewards must be shaped very carefully, much R&D labor goes into here, as well as similar challenges to distributed pretraining.

A significant challenge is that coding/knowledge work tasks these days are getting extremely difficult, we are far beyond 2024 days where models could barely solve the easiest problems in SWE-bench. Tasks at the frontier now look more like mini projects that would take humans multiple hours or even days to finish (or in some cases, research-style tasks that would be beyond reach for even top human experts, such as the Erdős unit distance problem which was posed in 1946 but wasn't solved until recently, by GPT-5.5). Huge amounts of trajectories must be produced, and huge amounts of them produce zero reward and therefore are useless for learning. Getting a cold start requires running tens of thousands of instances of your model in VMs in parallel for multiple days to produce trajectories, to say nothing of the GPU costs.

So what do you do when you only have a model which is capable of basic conversations but cannot even begin to tackle basic coding tasks, use tools, etc? The approach that companies behind the frontier have decided on is to bootstrap their learning process by having an already extremely intelligent model such as Claude produce hundreds of thousands of seed trajectories for them. Then they can use this data to get a warm start and begin learning immediately. And if you use Claude for your reward model too, you get to skip the nastiness of reward shaping.

Therefore, even if in number of raw tokens the data are much smaller than internet-scale pretraining data, the value that each token provides is far far greater.

[1] For example, Grok 4 compute spend on RL was ~100% of that of pretraining: https://www.interconnects.ai/p/grok-4-an-o3-look-alike-in-se...

props for a great write-up

Actually it's a hit piece.

[deleted]

A description that highlights the importance of RL is a hit-piece?

Training isn’t a single homogeneous step. It starts with pretraining which requires bulk PB of data but you have less quality concerns here. You cover the whole data distribution. Later stages require less and less but increasingly higher quality and complex datasets. The late stage ones are highly curated and might even be sourced from world subject experts. This is where frontier labs with big pockets have the advantage.

Actually nowadays LLMs are only trained with TBs rather than PBs of data, and it's not too hard to find GBs of agent traces online.

This might be like an observational study vs a study with a control?

From what I understand, at this point, the main value of stronger model outputs is simply to bootstrap reasoning behavior during the RL post-training phase. It gets you past the “cold start” problem with RL, after which the outputs aren’t needed anymore. From then on, it’s hill climbing and that requires environments for the model to interact with get rewards from.