Come here to write this - perfect analogy!

It's very imperfect analogy though these things can't be rebuilt "from scratch" like a program, the training process doesn't seem to be replicable anyway. Nonetheless, full data disclosure is necessary, according to the result of the years-long consultation led by the Open Source Initiative https://opensource.org/ai

> the training process doesn't seem to be replicable anyway

The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

If you're speaking about the computational cost, it used to be that way for compilers too. Give it 20 years and you'll be able to train one of today's models on your phone.

> The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.

> Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using

That has everything to do with implementation, and nothing to do with algorithm. There is an important difference.

Math is deterministic. The way [random chip] implements floating point operations may not be.

Lots of scientific software has the ability to use IEEE-754 floats for speed or to flip a switch for arbitrary precision calculations. The calculation being performed remains the same.

> Math is deterministic.

The point is none of these models are trained with pure "math". It doesn't matter that you can describe a theoretical training process using a set of deterministic equations, because in practice it doesn't work that way. Your claim that "the training process is fully deterministic" is objectively wrong in this case because none of the non-toy models use (nor they practically can use) such a deterministic process. There is a training process which is deterministic, but no one uses it (for good reasons).

If you had infinite budget, exactly the same code, the same training data, and even the same hardware you would not be able to reproduce the weights of Deepseek R1, because it wasn't trained using a deterministic process.

A lot of quibbling here, wasn't sure where to reply. If you've built any models in PyTorch, then you know. Conceptually it is deterministic, a model trained using deterministic implementations of low level algorithms will produce deterministic results. And when you are optimizing the pipeline, it is common to do just that:

    torch.manual_seed(0)
    random.seed(0)
    np.random.seed(0)
    torch.use_deterministic_algorithms(True)
But in practice that is too slow, we use nondeterministic implementations that run fast and loose with memory management and don't necessarily care about the order in which parallel operations return.

I’m pretty sure the initial weights are randomized meaning no two models will train in the same way twice. The order in which you feed in training data to the model would also add an element of randomness. Model training is closer to growing a plant than running a compiler.

That's still a deterministic algorithm. The random data and the order of feeding training data into it are part of the data which determines the output. Again, if you do it twice the same way, you'll get the same output.

If they saved the initial randomized model and released it and there was no random bit flipping during copying, then possibly but it would still be difficult when you factor in the RLHF that comes about through random humans interacting with the model to tweak its workings. If you preserved that data as well, and got all of the initial training correct... maybe. But I'd bet against it.

So long as the data provided was identical, and sources of error like floating point errors due to hardware implementation details are accounted for, I see no reason output wouldn't be identical.

Where would other non-determinism come from?

I'm open to there being another source. I'd just like to know what it would be. I haven't found one yet.

> if you do it twice the same way, you'll get the same output

Point at the science that says that, please: Current scientific knowledge doesn't agree with you.

> Current scientific knowledge doesn't agree with you.

I'd love a citation. So far you haven't even suggested a possible source for this non-determinism you claim exists.

What makes models non-deterministic isn't the training algorithm, but the initial weights being random.

Training is reproducible only if, besides the pipeline and data, you also start from the same random weights.

That would fall under "Feed the same data in and you'll get the same weights out." Lots of deterministic algorithms use a random seed.

So is there no “introduce randomness” at some step afterwards? If not, I would guess these models would be getting stuck in a local maxima

> If not, I would guess these models would be getting stuck in a local maxima

It sounds like you're referring to something like simulated annealing. Using that as an example, the fundamental requirement is to introduce arbitrary, uncorrelated steps -- there's no requirement that the steps be random, and the only potential advantage of using a random source is that it provides independence (lack of correlation) inherently; but in exchange, it makes testing and reproduction much harder. Basically every use of simulated annealing or similar I've run into uses pseudorandom numbers for this reason.

Can you point at the research that says that the training process of a LLM at least the size of OLMo or Pythia is deterministic?

Can you point to something that says it's not? The only source of non-determinism I've read of affecting LLM training is floating point error which is well understood and worked around easily enough.

Search more, there is a lot of literature discussing how hard the problem of reproducibility of GenAI/LLMs/Deep Learning is, how far we are from solving it for trivial/small models (let alone for beasts the size of the most powerful ones) and even how pointless the whole exercise is.

If there's a lot, then it should be easy for you to link an example right? One that points toward something other than floating point error.

There simply aren't that many sources of non-determinism in a modern computer.

Though I'll grant that if you've engineered your codebase for speed and not for determinism, error can creep in via floating point error, sloppy ordering of operations, etc. These are not unavoidable implementation details, however. CAD kernels and other scientific software do it every day.

When you boil down what's actually happening during training, it's just a bunch of matrix math. And math is highly repeatable. Size of the matrix has nothing to do with it.

I have little doubt that some implementations aren't deterministic, due to software engineering choices as discussed above. But the algorithms absolutely are. Claiming otherwise seems equivalent to claiming that 2 + 2 can sometimes equal 5.

> I have little doubt that some implementations aren't deterministic

Not some of them; ALL OF THEM. Engineering training pipelines for absolute determinism would be, quite frankly, extremely dumb, so no one does it. When you need millions of dollars worth of compute to train a non-toy model are you going to double or triple your cost just so that the process is deterministic, without actually making the end result perform any better?

Depends on how much you value repeatability in testing, and how much compute you have. It's a choice which has been made often in the history of computer science.

The cost of adaptive precision floats can be negligible depending on application. One example I'm familiar with from geometry processing: https://www.cs.cmu.edu/~quake/robust.html

Integer math often carries no performance penalty compared to floating point.

I guess my takeaway from this conversation is that there's a market for fast high-precision math techniques in the AI field.