A question I've got which I've been wondering about, not sure if anyone else has been thinking about it, what actually made Fable so effective?
From what I could tell from the very little time that I had to interact with it, it's instruction following seemed more consistent
The other thing that comes to mind is a lot of people commented on how driven it was, so I'm wondering whether figuring out how to keep existing models looping on task might actually be quite a big shift in capability
Probably just a bigger version of Opus if I had to wager, and Opus is just a bigger version of Sonnet. Maybe some small architectural differences baking in an additional few months of ablation studies/research. But the fundamental driver is new pretrain with larger size. Probably corresponding to when some new generation of GPUs/new datacenter came online rather than any major qualitative breakthrough.
Hints: They created a new label instead of version bumping Opus, they didn't deprecate Opus, and it costs more per token.
Fable had mostly the same pre-training data as Opus, and it's likely they're distilled from the same source. The difference is that it's a larger model with more post training on "dangerous" stuff they didn't want in the core model, and "long" task RL.
> it's likely they're distilled from the same source
Any credible references for this? The implication that Anthropic has an even bigger and better model that they haven't released is hard to believe.
Lab folks keep cards close to their chests here, but it's likely Mythos was an earlier teacher model for Opus that got additional cybersec post-training. Whether they have a bigger tier than that is hard to say, labs have been cautiously scaling parameters since the failure of GPT4.1. They 100% have a bigger/better model they haven't released, but that's probably more down to it not being done cooking yet. Once it's done, the single larger model lets them drop new Opus and Mythos iterations in rapid succession.
Googlers have hinted that Gemini 3 came in at 10T, which seems hard to operationalize, Google's flash and pro releases are staggered in a way that doesn't make sense if flash is a pro distill, and there are enough cases where Gemini flash outperforms pro on the same task that I think it's unlikely it's just being distilled from an "in progress" version of pro.
Appreciate the long answer. Why is it more likely that Gemini 3 Pro/Flash/Lite are distillations of the same parent model than that they’re different training runs on the same dataset, with minor version bumps being different post-training setups?
The biggest tell is the fact that labs are staggering smaller model releases so much with big models. If the small models (flash, sonnet/haiku) were being distilled from pro models, you'd consistently see them be released fairly soon after new pro releases to maximize their competitiveness (and this was the case early on for Anthropic). Instead it seems like releases are timed to build/maintain hype.
A thing to keep in mind is that if they release a smaller model halfway between well spaced big model releases, why wait so long on the next big model release if it's sufficiently ready to distill to a smaller model? The ability to demonstrate AI superiority is worth a ton, there's no reason to hold back.
The big AI labs are also accumulating huge datasets of expert work in a wide range of fields, which is very expensive to re-create. It seems pretty plausible that this this gives them a big advantage that is compounded by their larger training runs and larger models.
This is a differentiator, definitely, however I'm honestly not sure if that materially improves intelligence vs one-shot capability
[flagged]