> considering that model training process is non-deterministic
Why would it have to be? Just use PRNG with published seeds and then anyone can reproduce it.
> considering that model training process is non-deterministic
Why would it have to be? Just use PRNG with published seeds and then anyone can reproduce it.
I have zero actual experience in training models, but in general, when parallelizing work: there can be fundamental nondeterminism (e.g., some race conditions) that is tolerated, whose recording/reproduction can be prohibitive performance-wise.