> commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.

Hmm? Given the same runtime, the same weights, and with the model actually giving deterministic output with temp=0, are you saying this isn't actually deterministic? Most FOSS/downloadable models tend to work as expected with temp=0 in my experience. Obviously that won't give you "most factual" output, because that's something completely else, but with most models it should give you deterministic output.

"What might be more surprising is that even when we adjust the temperature down to 0This means that the LLM always chooses the highest probability token, which is called greedy sampling. (thus making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice (see past discussions here, here, or here)"

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

Also from the article:

"Note that this is “run-to-run deterministic.” If you run the script multiple times, it will deterministically return the same result. However, when a non-batch-invariant kernel is used as part of a larger inference system, the system can become nondeterministic. When you make a query to an inference endpoint, the amount of load the server is under is effectively “nondeterministic” from the user’s perspective"

Which is a factor you can control when running your own local inference, and in many simple inference engines simply doesn't happen. In those cases you do get deterministic output at temperature=0 (provided they got everything else mentioned in the article right)

Having implemented LLM APIs, if you selected 0.0 as the temperature, my interface would drop the existing picking algorithm and select argmax(Logits)

There's usually an if(temp == 0) to change sampling methods to "highest probability" -- if you remove that conditional but otherwise keep the same math, that's not deterministic either.

If you remove the conditional and keep the same math, you divide by zero and get nans. In the limit as temperature goes to zero, you do in fact get maximum likelihood sampling.

if (t==0) argmax(logits) else pick(logits)

I'd assume that's just an optimization? Why bother sorting the entire list if you're just gonna pick the top token, linear time versus whatever your sort time is.

Having said that, of course it's only as deterministic as the hardware itself is.

The likelihood that top-two is close enough to be hardware dependent is pretty low. IIUC It's more of an issue when you are using other picking methods.

In for example llama.cpp? Specific to the architecture or in general? Could you point out where this is happening? Not that I don't believe you, but I haven't seen that myself, and would appreciate learning deeper how it works.