There's usually an if(temp == 0) to change sampling methods to "highest probability" -- if you remove that conditional but otherwise keep the same math, that's not deterministic either.

If you remove the conditional and keep the same math, you divide by zero and get nans. In the limit as temperature goes to zero, you do in fact get maximum likelihood sampling.

if (t==0) argmax(logits) else pick(logits)

I'd assume that's just an optimization? Why bother sorting the entire list if you're just gonna pick the top token, linear time versus whatever your sort time is.

Having said that, of course it's only as deterministic as the hardware itself is.

The likelihood that top-two is close enough to be hardware dependent is pretty low. IIUC It's more of an issue when you are using other picking methods.

In for example llama.cpp? Specific to the architecture or in general? Could you point out where this is happening? Not that I don't believe you, but I haven't seen that myself, and would appreciate learning deeper how it works.