LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).
LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).
This is an oversimplification. When distributed, the nondeterministic order of additions during reductions can produce nondeterministic results due to floating point error.
It’s nitpicking for sure, but it causes real challenges for reproducibility, especially during model training.
+ can also just pin the seed instead, right?
This belief (LLMs are deterministic except for samplers) is very wrong and will get you into hilariously large amounts of trouble for assuming it's true.
Also greedy sampling considered harmful: https://arxiv.org/abs/2506.09501
From the abstract:
"For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices."