Hacker News

> These flags don't magically change LLM formalisms. You can read more about how floating point operations produce non-determinism here:

Basically what you're saying is "for 99.9% of use cases and how people use them they are non-deterministic, and you have to very carefully work around that non-determinism to the point of having workarounds for your GPU and making them even more unusable"

> In this context, forcing single-threading bypasses FP-hardware's non-associativity issues that crop up with multi-threaded reduction.

Translation: yup, they are non-deterministic under normal conditions. Which the paper explicitly states:

--- start quote ---

existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs.

--- end quote ---

> If you still don't have bit-replicated outputs for the same input sequence, either something is seriously wrong with your computer or you should get in touch with a reputable metatheoretician because you've just discovered something very significant.

Basically what you're saying is: If you do all of the following, then the output will be deterministic:

- workaround for GPUs with num_thread 1

- temperature set to 0

- top_k to 0

- top_p to 0

- context window to 0 (or always do a single run from a new session)

Then the output will be the same all the time. Otherwise even "non-shitty corp runners" or whatever will keep giving different answers for the same question: https://gist.github.com/dmitriid/5eb0848c6b274bd8c5eb12e6633...

Edit. So what we should be saying is that "LLM models as they are normally used are very/completely non-deterministic".

> Perhaps in the future you can learn from this experience and start with a post like the first part of this

So why didn't you?