ELI5: how (specifically) do GPU and TPU optimisations effect determinism in LLMs? Or is this all a myth?

LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).

This is an oversimplification. When distributed, the nondeterministic order of additions during reductions can produce nondeterministic results due to floating point error.

It’s nitpicking for sure, but it causes real challenges for reproducibility, especially during model training.

+ can also just pin the seed instead, right?

This belief (LLMs are deterministic except for samplers) is very wrong and will get you into hilariously large amounts of trouble for assuming it's true.

Also greedy sampling considered harmful: https://arxiv.org/abs/2506.09501

From the abstract:

"For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices."

They don't affect determinism of the results but different architectures have different determinism guarantees with respect to performance, as a result of scheduling and other things.

TPUs share a similar lineage to the Groq TPU accelerators (disclaimer: I work at Groq) which are actually fully deterministic which means not only do you get deterministic output, you get it in a deterministic number of cycles.

There is a trade off though, making the hardware deterministic means you give up HW level scheduling and other sources of non-determinism. This makes the architecture highly dependent on a "sufficiently smart compiler". TPUs and processors like them are generally considered VLIW and are all similarly dependent on the compiler doing all the smart scheduling decisions upfront to ensure good compute/IO overlap and eliminating pipeline bubbles etc.

GPUs on the other hand have very sophisticated scheduling systems on the chips themselves along with stuff like kernel swapping etc that make them much more flexible, less dependent on the compiler and generally easier to reach a fairly high utilisation of the processor without too much work.

TLDR: TPUs MAY have deterministic cycle guarantees. GPUs (of the current generation/architectures) cannot because they use non-deterministic scheduling and memory access patterns. Both still produce deterministic output for deterministic programs.