Why does Microsoft keep releasing models trained on synthetic data? Is it possible their contract with OpenAI won't let them do anything else?
I would think Microsoft, of all companies, would want to be working on their own LLM behind the scenes, even if they're relying on OpenAI for the bulk of their work.
Meta seems to be the only US company releasing big 'open source' models, while Chinese companies continue to release many completely open source LLMs.
I don’t think there’s any strict reason they can’t from their contract. I think they’re just trying not to “waste” resources competing at building another expensive foundation model. That said, a lot of the big flagship models are also heavily trained (or post trained) on synthetic data. Microsoft has done a lot of application-specific fine tuning research.
This model in particular makes sense to be synthetic though. It’s explicitly trained to control a computer, and I doubt there’s a large enough amount of public training data on this use case.
I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west. There’s tons of stellar LLMs available from major US companies if you’re just using an API. It’s also a convenient marketing and differentiation opportunity. Some of the companies behind the bigger “agentic” models have started to offer a cheap subscription alternative to US companies. If they build up a big enough business I wouldn’t be surprised if they stop open sourcing right away.
> I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west.
The obvious bias of the models, when it comes to Chinese politics and history, certainly does not help here.
> I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west.
They're late to the game so they're pressuring Western competitors on price by taking advantage of their lowest costs while catching up. Now they are well prepared to lead in the next front: robotics.
The attorneys said so. This is why progress happens in startups and gets bought by the big boys. They’re constitutionally incapable of innovation.
Yeah, well in this case it would be a feature rather than a bug to be squeamish about outright theft and repackaging of copyrighted materials for profit. If only it would also apply to their acquisitions too...
Depends on how you define big, but there’s Gemma, Phi, OLMO, Mistral and GPT-OSS that are all competitive and can run on commodity hardware.
It is just much more efficient to train on synthetic data. When you train on real data, all you know is the next token. With synthetic data you know the probability distribution of the next token; this results in a multiplier effect, and sometimes this effect is dramatic.
[1] https://arxiv.org/pdf/2504.14772v1
My guess is that it is safer for them to use synthetic data only, as they have less to worry about stuff like people using the models for erotic roleplay and similar stuff.
It's a cost and time saving measure. Human labeling is hard to scale and it takes time. With synthetic data, they can train faster and cheaper and speed up the pace at which they produce new models and run experiments with new types of models. Grok is doing similar things. It's smart.
> Why does Microsoft keep releasing models trained on synthetic data?
Why not? That's the way to go. In some domains the only way to go.
Perhaps they want to be able to run them on mobile hardware they release?
I can definitely see them wanting to have models that can run on Windows computers or Surface tablets locally - although their focus seems to be sticking CoPilot into absolutely anything and everything possible, but why synthetic data models? Other companies have made small parameter models, but they don't seem to keep them up to date (correct me if I'm wrong).
They're not very skilled