No sign of what source material it was trained on though right? So open weight rather than reproducible from source.
I remember there's a project "Open R1" that last I checked was working on gathering their own list of training material, looks active but not sure how far along they've gotten:
Isn't it basically not possible for the input data set list to be listed? It's an open secret all these labs are using immense amounts of copyrighted material.
There's a few efforts at full open data / open weight / open code models, but none of them have gotten to leading-edge performance.
My brain was largely trained using immense amounts of copyrighted material as well. Some of it I can even regurgitate almost exactly. I could list the names of many of the copyrighted works I have read/watched/listened to. I suppose my brain isn't open source, although I don't think it would currently be illegal to take a snapshot of my brain and publish it if the technology existed and open-source that. Granted, this would only be "reproducible" from source if you define the "source" as "my brain" rather than all of the material I consumed to make that snapshot.
:-) I like the symmetry of this. If I want to keep my creations outside the hands of others, I can keep them private. I don’t have to publish these words or broadcast them to the world. I could write this on my laptop, save it in a file, and keep it to myself. Fine.
However, once these words are broadcast—once they’re read, and the ideas expressed here enter someone else’s mind—I believe it’s only fair that the person on the receiving end has the right to use, replicate, or create something from them. After all, they lent me their brain—ideas that originated in my mind now live in theirs.
This uses up their mental "meat space," their blood sugar, and their oxygen—resources they provide. So, they have rights too: the right to do as they please with those ideas, including creating any and all data derived from them. Denying them that right feels churlish, as if it isn’t the most natural thing in the world.
(Before people jump on me:- Yes, creators need to be compensated—they deserve to make a living from their work. But this doesn’t extend to their grandchildren. Copyright laws should incentivize creation, not provide luxury for the descendants of the original creator a century later.)
> Some of it I can even regurgitate almost exactly
If you (or any human) violate copyright law, legal redress can be sought. The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances.
There are many other differences between humans and AI in terms of capabilities and motivations to f the legal persons making decisions.
You may be right about the damage (will not dispute it even if I personally doubt it) - but what about the amount of good that it can do too? When deciding "what is to be done now" under uncertainty, we typically look at both sides of the ledger, the upsides in addition to the downsides.
Assume for a moment, that the current AI is teaching us that compute transforming data → information → knowledge → intelligence → agency → ... → AGI → ASI, is all there is to Intelligence-on-Tap? And imagine an AI path opens to AGI now and ASI later, where previously we didn't see any. Seems a bad deal to me, to frustrate, slow down, or even forego the 2050-s Intelligence Revolution that may multiply total human wealth by a factor of 10 to 20 in value, the way the Industrial Revolution did in the 1800-s. And we are to forego this, for what - so that we provide UBI to Disney shareholders? Every one of us is richer, better off now, than any king of old. Not too long ago, even the most powerful person in the lands could not prevent their 17 miscarriages/stillbirths/child_deaths failing to produce an heir to ascend the throne (a top priority that was, for sure for a king+queen). So in our imagined utopia, even the Disney shareholders are better off than they would be otherwise.
> Seems a bad deal to me, to frustrate, slow down, or even forego the 2050-s Intelligence Revolution that may multiply total human wealth by a factor of 10 to 20 in value...
Why do you assume the emergence of a super intelligence would result in human wealth increasing instead of decreasing? Looking at how humans with superior technology used it to exploit fellow humans throughout history should give you pause. Humans don't care about the aggregate "dog wealth" - let alone that of ants.
I'm assuming the Intelligence Revolution, multiplying Human Intelligence with machines, will have the same effect as the Industrial Revolution had, on multiplying human physical strength. That multiplied the GDP by a factor of ~20 times, hockey stick like, in a fairy short time, a century or two.
The industrial revolution was powered by natural resources that it helped unlock. What value reserve will ai tap into to create hockey stick growth?
It will recombine the existing resources in new ways. Neanderthals had access to exactly the same natural resources as we have now. Obviously we do much more with what we both got, then they ever did. Obviously it's not only the availability of some atoms or molecules, but what one does with them, how one recombines them in novel ways. For that one needs knowledge and energy. And the later mostly turns out can be derived from the the former too.
Obviously it's what we do with them, the biotech manufacturing and nuclear power production revolution happened pre AI. The reason it hasn't replaced petroleum is economic and social.
The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances
But enough about whether it should be legal to own a Xerox machine. It's what you do with the machine that matters.
> It's what you do with the machine that matters.
The capabilities of a machine matter a lot under law. See current US gun legislation[1], or laws banning export of dual-use technology for examples of laws that have inherent capabilities - not just the use of the thing- as core considerations.
1. It's illegal to possess a new, automatic weapon with some grandfathering prior to 1986
While true, computers in general alreay had the ability to perfectly replicate data, hence blank media tax: https://en.wikipedia.org/wiki/Private_copying_levy
I think the reason for all the current confusion is that we previously had two very distince groups of "mind" and "mindless"*, and that led to a lot of freedom for everyone to learn a completly different separation hyperplane between the categories, and AI is now far enough into the middle that for some of us it's on one side and for others of us it's on the other.
* and various other pairs that are no longer synonyms but they used to be; so also "person" vs. "thing", though currently only very few actually think of AI as person-like
Yes, but gun control and dual-use export regulations are both stupid. We need fewer tool-blaming laws, not more.
(Standing by for the inevitable even-goofier analogy comparing AI with privately-owned nuclear arsenals...)
The only way this would work is with "leaks". But even then as we saw with everything on the internet, it just added another guardrail on content. Now I can't watch youtube videos without logging in, and nearly every website I need to solve some weird ash captchas. It's becoming easier to interact with this chatbots rather than search for a solution online. And I wonder with Veo 4 copy cats, it might be even easier to prompt for a video rather than search for one.
That doesn't mean it isn't possible.
“Not possible” = “a business-destroying level of honesty”?
Even if training on the copyrighted material is OK, just providing a data dump of it almost certainly is not.
No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.
providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.
Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.
...no? They also use web crawlers.
The datasets are collected using web crawlers, but that doesn’t tell us anything about how they are stored and re-distributed, right?
Why would you store the data after training?
Are you saying that you know they don’t store the data after training?
I’d just assume they did because—why scrape again if you want to train a new model? But if you know otherwise, I’m not tied to this idea.
I'm also assuming. But I would ask the opposite question: why store all that data if you'll have to scrape again anyway?
You will have to scrape again because you want the next AI to get trained on updated data. And, even at the scale needed to train an LLM, storing all of the text on the entire known internet is a very non-trivial task!
If you try to reproduce various open datasets like fineweb by scraping the pages again, you can't, because a lot of the pages no longer exist. That's why you would prefer to store them instead of losing the content forever.
It's not "all of the text", it's like less than 100 trillion tokens, which means less than 400TB assuming you don't bother to run the token streams through a general purpose compression algorithm before storing them.
There is a "keep doing what you're doing, as we would want one of our companies to be on top of the AI race" signal from the governments. It could've been stopped, maybe, 5 years ago. But now we're way past it, so nobody cares about these sort of arguments.
> No sign of what source material it was trained on though right?
out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..
When you're trully open source, you can make ethings like this:
Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.
https://allenai.org/blog/olmotrace
you can do these same, except you would need to be a pirate website. It would even be better. except illegal. but it would be better.
That is why the others can't provide stuff like this. RAG/Hallucination check. I just wish Allen.AI models had bigger context, 4k is too small nowadays.
Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is
If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.
Many are speculating it was trained by o1/o3 for some of the initial reasoning.
Are there any widely used models that publish this? If not, then no I guess.
Depending on how you use "randomly", they absolutely can..?
Based on commit history Open R1 still active and they're still making progress. Long may it continue, it's an ambitious project.
This was simply a mad scramble to prove/disprove the claims OpenAI was peddling that the model wasn’t actually performing as well as advertised and that they were lying about the training/compute resources. Open-R1 has since applied the training to a similar 7B model and got similar results. At the end of the day, no one really cares what the data was that it was trained on and most AI providers don’t always share this either when releasing open source models, and certainly not available for closed source models.
I don't think people make the distinction like that. The open source vs non open source distinction boils down to, usually, can you use it for commercial use.
what you're saying is just that it's non reproducible, which is a completely valid but separate issue
There's already established terms and licenses for non-commercial use. Like "open weights".
Open source has the word "source" in it for a reason, and those models ain't open source and have nothing to do with it.
Took me until this thread to remember that in the 90s we had "freeware".
But where's the source? I just see a binary blob, what makes it open source?
The weights are the source. It isn't as though something was compiled into weights. They're trained directly. But I know what you mean, it would be more open to have the training pipeline and souce dataset available.
The weights seem much more like a binary to me, the training pipeline the compiler, and the training dataset the source.
Come here to write this - perfect analogy!
It's very imperfect analogy though these things can't be rebuilt "from scratch" like a program, the training process doesn't seem to be replicable anyway. Nonetheless, full data disclosure is necessary, according to the result of the years-long consultation led by the Open Source Initiative https://opensource.org/ai
> the training process doesn't seem to be replicable anyway
The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.
If you're speaking about the computational cost, it used to be that way for compilers too. Give it 20 years and you'll be able to train one of today's models on your phone.
> The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.
No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.
> Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using
That has everything to do with implementation, and nothing to do with algorithm. There is an important difference.
Math is deterministic. The way [random chip] implements floating point operations may not be.
Lots of scientific software has the ability to use IEEE-754 floats for speed or to flip a switch for arbitrary precision calculations. The calculation being performed remains the same.
> Math is deterministic.
The point is none of these models are trained with pure "math". It doesn't matter that you can describe a theoretical training process using a set of deterministic equations, because in practice it doesn't work that way. Your claim that "the training process is fully deterministic" is objectively wrong in this case because none of the non-toy models use (nor they practically can use) such a deterministic process. There is a training process which is deterministic, but no one uses it (for good reasons).
If you had infinite budget, exactly the same code, the same training data, and even the same hardware you would not be able to reproduce the weights of Deepseek R1, because it wasn't trained using a deterministic process.
A lot of quibbling here, wasn't sure where to reply. If you've built any models in PyTorch, then you know. Conceptually it is deterministic, a model trained using deterministic implementations of low level algorithms will produce deterministic results. And when you are optimizing the pipeline, it is common to do just that:
But in practice that is too slow, we use nondeterministic implementations that run fast and loose with memory management and don't necessarily care about the order in which parallel operations return.I’m pretty sure the initial weights are randomized meaning no two models will train in the same way twice. The order in which you feed in training data to the model would also add an element of randomness. Model training is closer to growing a plant than running a compiler.
That's still a deterministic algorithm. The random data and the order of feeding training data into it are part of the data which determines the output. Again, if you do it twice the same way, you'll get the same output.
If they saved the initial randomized model and released it and there was no random bit flipping during copying, then possibly but it would still be difficult when you factor in the RLHF that comes about through random humans interacting with the model to tweak its workings. If you preserved that data as well, and got all of the initial training correct... maybe. But I'd bet against it.
So long as the data provided was identical, and sources of error like floating point errors due to hardware implementation details are accounted for, I see no reason output wouldn't be identical.
Where would other non-determinism come from?
I'm open to there being another source. I'd just like to know what it would be. I haven't found one yet.
> if you do it twice the same way, you'll get the same output
Point at the science that says that, please: Current scientific knowledge doesn't agree with you.
> Current scientific knowledge doesn't agree with you.
I'd love a citation. So far you haven't even suggested a possible source for this non-determinism you claim exists.
What makes models non-deterministic isn't the training algorithm, but the initial weights being random.
Training is reproducible only if, besides the pipeline and data, you also start from the same random weights.
That would fall under "Feed the same data in and you'll get the same weights out." Lots of deterministic algorithms use a random seed.
So is there no “introduce randomness” at some step afterwards? If not, I would guess these models would be getting stuck in a local maxima
> If not, I would guess these models would be getting stuck in a local maxima
It sounds like you're referring to something like simulated annealing. Using that as an example, the fundamental requirement is to introduce arbitrary, uncorrelated steps -- there's no requirement that the steps be random, and the only potential advantage of using a random source is that it provides independence (lack of correlation) inherently; but in exchange, it makes testing and reproduction much harder. Basically every use of simulated annealing or similar I've run into uses pseudorandom numbers for this reason.
Can you point at the research that says that the training process of a LLM at least the size of OLMo or Pythia is deterministic?
Can you point to something that says it's not? The only source of non-determinism I've read of affecting LLM training is floating point error which is well understood and worked around easily enough.
Search more, there is a lot of literature discussing how hard the problem of reproducibility of GenAI/LLMs/Deep Learning is, how far we are from solving it for trivial/small models (let alone for beasts the size of the most powerful ones) and even how pointless the whole exercise is.
If there's a lot, then it should be easy for you to link an example right? One that points toward something other than floating point error.
There simply aren't that many sources of non-determinism in a modern computer.
Though I'll grant that if you've engineered your codebase for speed and not for determinism, error can creep in via floating point error, sloppy ordering of operations, etc. These are not unavoidable implementation details, however. CAD kernels and other scientific software do it every day.
When you boil down what's actually happening during training, it's just a bunch of matrix math. And math is highly repeatable. Size of the matrix has nothing to do with it.
I have little doubt that some implementations aren't deterministic, due to software engineering choices as discussed above. But the algorithms absolutely are. Claiming otherwise seems equivalent to claiming that 2 + 2 can sometimes equal 5.
> I have little doubt that some implementations aren't deterministic
Not some of them; ALL OF THEM. Engineering training pipelines for absolute determinism would be, quite frankly, extremely dumb, so no one does it. When you need millions of dollars worth of compute to train a non-toy model are you going to double or triple your cost just so that the process is deterministic, without actually making the end result perform any better?
Depends on how much you value repeatability in testing, and how much compute you have. It's a choice which has been made often in the history of computer science.
The cost of adaptive precision floats can be negligible depending on application. One example I'm familiar with from geometry processing: https://www.cs.cmu.edu/~quake/robust.html
Integer math often carries no performance penalty compared to floating point.
I guess my takeaway from this conversation is that there's a market for fast high-precision math techniques in the AI field.
You can fine-tune their weights and release your own take.
E.g. see all the specialized third-party models out there based on Qwen.
"Open-source" is the wrong word here, what they mean is "you can modify and redistribute these weights".
You can also reverse engineer and modify closed source programs (see mods for games). Weights are like compiled version of source data.
Finetuning isn't reverse engineering. Finetuning is a standard supported workflow for these models.
Also, the "redistribute" part is key here.
> Finetuning isn't reverse engineering
Fully agree, it isn't. Reverse engineering isn't necessary for modifying compiled program behaviour, so comparing it to finetuning is not applicable. Finetuning applied to program domain would be more like adding plugins or patching in some compiled routines. Reverse-engineering applied to models would be like extracting source documents from weights.
> Finetuning is a standard supported workflow for these models.
Yes, so is adding mods for some games, just put your files in a designated folder and game automatically picks it up and does required modifications.
> Also, the "redistribute" part is key here.
It is not. Redistributability and being open source is orthogonal. You can have a source for a program and not be able to redistribute source or program, or you can redistribute a compiled program, but not have it's source (freeware).
Not legally. That's the difference.
Sure you can. It's often legally protected activity. You're just limited to distributing your modifications without the original work.
For some games maybe, but software often has a clause forbidding reverse engineering
ChatGPT says that such clauses are typically void in the EU, though they may apply in some cases in the US. Even in the US, the triennial DMCA rule-making has granted broader exemptions for good-faith security research every cycle since 2016.
https://chatgpt.com/share/6838c070-705c-8005-9a88-83c9a5550a...
There is work to try to reproduce (the original) R1: https://huggingface.co/open-r1
I won't call it "binary blob". Safetensors is just a simple format for storing tensors safely: https://huggingface.co/docs/safetensors/index