Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.
Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.
Not only that but heterogeneous clusters (inevitable at a large enough scale) will also have non-deterministic outputs. So it's great that they wrote kernels to make the forward pass deterministic but getting rid of it entirely at data center scale would mean that they'd also have to do this type of work across cluster nodes as well to maintain "cluster" invariance & not just batch invariance.