Hacker News

Training, especially on large GPU clusters, is inherently non-deterministic. Even, if all seeds are fixed.

This boils down to framework implementations, timing issues and extra cost of trying to ensure determinism (without guarantees).