> If anything these models should be compelled to be public since they have been trained off public data
I'm starting to come around to this idea TBH. For a while my position was: "these companies have invested billions into training these models, therefore they should be able to control them and profit off them" but looking deeper at where they got their training data, my view is starting to shift.
IMHO I feel like we need new laws around AI, specifically training data. Something like: "you can train an AI model and ignore copyright laws, BUT you must then make the model open weight", a company can still develop closed weight models but then they must aquire permission to use training data.
But it gets murky because if something like that was on the books then AI labs would just train open weight models and then distill them into their closed weight models.
labs invest multiple billion dollars a year each in private data, and that number is growing. internet training data is not where frontier capabilities come from, this view is outdated
This is a misleading statement. The "private data" is still largely publicly produced data that has been curated through private agreements instead of scraping, such as reddit posts/comments (this is the "third-party data agreements" that companies like OpenAI mention). And yes, there is still a lot of processing done on this data, which is the norm for preparing training data.
This is doubly misleading. A lot of private data is sourced through providers like e.g. Mercor, who pay experts to answer questions and write out their reasoning. (E.g. paying a software engineer to write a project from scratch and recording every keystroke, paying a Chem PhD to answer hard Chem questions, etc.). A second source of private data comes from custom RL environments with fine-grained intermediate rewards for e.g. software engineering, financial modeling, etc.. Also, imagine the amount of usage data recorded by Claude Code, etc. Pretraining is mostly curated public data, post-training is increasingly private expert data and tests.
Source: Work at a lab, common knowledge.
Well since you work at a lab you should know that most capabilities arise in pretraining, not posttraining or mid training, and the latter two mostly function to bring out the hidden intelligence in these models more than anything else.
Source: also work at a lab.
No, it isn't. The private data is largely private data, created by highly-specialized, highly-paid contracted teams of experts for domains finance, swe, consulting, etc.
Reddit data is just not that interesting, that deal is worth like $60m/year. Labs spend 10x as much on computer-use RL environments.
Sorry but your argument doesn't seem coherent: How is the cost of RL relevant here?
It would also help if you could substantiate your initial claim (i.e. "internet training data is not where frontier capabilities come from")
RL environment (instruction, stateful container, reward function) is the training data product being bought
Why are the leading models capable of regurgitating full copyrighted works such as "Harry Potter" and "On the Road"? Did they hire someone to type those out for them?
https://arxiv.org/abs/2601.02671
When did they start doing so? We all know that they DID train on all the available public information, so at what point did they stop? Is the public information still in the training set? If so, they should STILL release ALL the data as public, as they are including training data that was acquired without permission.
They haven't stopped. I honestly don't understand how they ever could.
> internet training data is not where frontier capabilities come from
In that case, it should be no problem for the labs to train their new models without using public data, right?
Then it should be simple for one of the frontier labs to produce a model trained only on private data. We haven't seen that.
Didn't the famous "Textbooks are all you need" paper already proof that point three years ago?
Sure, we ask a lot more of modern models, but private training data also got a lot better. You would loose out on a lot of long-tail knowledge, but that can be fixed with web search tools. You'd limit the styles, dialects and colloquial phrases the model understands and can use, but for many use cases that would be fine
But why would any frontier lab do that? Throwing in more training data still leads to better results in pretraining. And showing that they don't need to hoover up the internet and Anna's Archive only empowers regulators to prevent them from doing that
Maybe I am missing your point but "Textbooks are all you need" distilled from GPT-3.5
> internet training data is not where frontier capabilities come from
We 100% would not be at the current progress without it, though. And it's not like they only train on this once. They keep training on all the internet data PLUS the private data. Private data only (probably) wouldn't work, as learning the base regularities of language takes a lot of weights.
Define "come from". Could they have gotten those frontier capabilities, or any capabilities, without internet training data? It seems to me that without the private data, you might get a slightly less competitive model, but without the CommonCrawl-style data piles used in "pretraining", you get no model at all.
Even accepting the copying-as-theft framing, if I go to a village, steal some vegetables from everyone's gardens and ham from their sheds, and then add some prohibitively expensive spices I bought myself to make soup, do I get to claim it as mine and punish the villagers for trying to take it?
Great way to launder illegally obtained data too.
Does this private data come from places like Reddit, Twitter, etc., where it’s contributed by users? I think it is unethical for these companies to accept payment for user-contributed data.
Okay that's fine, then make the law say they must provide publicly owned models off of publicly obtained data. To think that such a baseline of critical information isn't is the literal foundation of everything they will do, both now in the future, is just exposing what their end game is: control.
There no reason to not to otherwise outside of the poor little billion dollar corporations not wanting to provide a public utility they stolen from the public.
Anything that removes control from American big tech is a good thing for American citizens and the world writ large.
No, you're talking about fine tuning and most of it is coming from your customers or someone else's. Get off ya high horse.
Copyright needs abolishing.
Companies can't be trusted with societies need for open progress.
The frontier labs are not "fine-tuning", they're doing massive scale RL post-training
I'm not taking sides here but this situation is not so black and white and it has always been the darker side of capitalism.
The concept of Intellectual property exists not because it's fair but because it creates incentive to make said "intellectual property" exist. If intellectual property can be instantly copied by a competitor... why would I spend a dime to even create such a thing? I want to profit off of what I make because I'm a capitalist and money is what drives me (as a capitalist).
Anthropic models wouldn't exist if they couldn't keep a unholy grip on it. Same with openAI. Same with many life saving drugs.
Of course everyone here is talking about the obvious stuff like how it's morally wrong to with-hold life saving drugs or to have AI literally take over the world and be under the control of one company and all of this is true. But it is also true that greed is the engine that drives our economy and if you want our economy to produce "intellectual property" you must allow people to "capitalize" on that greed.
There are two controversial issues here. What is moral/fair? And what is realistically practical in optimizing the economy if said economy is based on money.
The distillation in my mind is a win for practicality because Competition also drives our economic engine. First you don't want a monopoly, but you also don't want these models to be so damn open that there's zero incentive to make them.
That intellectual property argument goes both ways. The model might not exist without protection, but it also would not exist without the data.
This perfectly explains why current LLMs should be illegal in an actual capitalist market.
Why should anyone publish anything if it can be stolen with impunity? Is the value of these LLMs even remotely close to the amount of value they stole and the amount of value they will detract from economy because people will be more hesitant to publish anything now?