> The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.
No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.
> Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using
That has everything to do with implementation, and nothing to do with algorithm. There is an important difference.
Math is deterministic. The way [random chip] implements floating point operations may not be.
Lots of scientific software has the ability to use IEEE-754 floats for speed or to flip a switch for arbitrary precision calculations. The calculation being performed remains the same.
> Math is deterministic.
The point is none of these models are trained with pure "math". It doesn't matter that you can describe a theoretical training process using a set of deterministic equations, because in practice it doesn't work that way. Your claim that "the training process is fully deterministic" is objectively wrong in this case because none of the non-toy models use (nor they practically can use) such a deterministic process. There is a training process which is deterministic, but no one uses it (for good reasons).
If you had infinite budget, exactly the same code, the same training data, and even the same hardware you would not be able to reproduce the weights of Deepseek R1, because it wasn't trained using a deterministic process.