> security audits

If you are unable to run the multimillion training, then any kind of security audit of the training code is absolutely meaningless, because you have no way to verify that the weights were actually produced by this code.

Also, the analogy with source code/binary code fails really fast, considering that model training process is non-deterministic, so even if are able to run the training, then you get different weights than those that were released by the model developers, then... then what?

I probably shouldn't have led with that example because yeah, reproducible (and cheap) builds would be best for security audits. But I wouldn't say it's absolutely meaningless. At least it can guide your experimentation, and if results start differing radically from what you'd expect from the training data, that raises interesting questions.

If you're going through the effort to be open source you can probably set up fixed batch sizes and deterministic combination of batches without too much more effort. At least I hope it's not super hard.

> considering that model training process is non-deterministic

Why would it have to be? Just use PRNG with published seeds and then anyone can reproduce it.

I have zero actual experience in training models, but in general, when parallelizing work: there can be fundamental nondeterminism (e.g., some race conditions) that is tolerated, whose recording/reproduction can be prohibitive performance-wise.