In theory, temperature 0 does make the LLM deterministic.

Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).

However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

>in theory theory, temperature 0 doesn't really exist.

It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.

> Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function.

In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.

That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal. That's why greedy sampling generally produces deterministic output for LLMs. The real gotchas are elsewhere (like with batch inference as we've seen with earlier GPTs). But unlike what the earlier comment says, this is a non-issue mathematically.

> That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal

Any two tokens ending up with the exact same logit is very unlikely, but not impossible; and as the number of output tokens grows, the odds that it will happen eventually gets higher and higher.

I suppose, to ensure determinism, rank by logit then token ID, so you still have a deterministic winner even if occasionally two tokens get precisely identical logits.

You aren't looking for a random set of tokens that have the exact same logit, you are looking for the largest n tokens to have the exact same probability.

This is exceedingly unlikely, as training will only push one of them up for any individual sample. There are likely some pathological situations that could end up with that situation, maybe, but it is pretty unlikely in a general case.

"Makes unlikely" is very different from "prevents."

If there's one counterexample, it's not really deterministic.

Exactly, consider the scenario where laws are at play and violating them could cost companies thousands. Recently my father received a 'request for address' letter addressed to me at his nursing home, the building has always been a nursing home, and he's also in his mid-70s. That's very obviously a violation of the Fair Debt Collection Practices Act. Imagine the implication of this if the law firm in questions used an AI-assisted data enriching product to find this information. That SaaS company is not only liable to that one law firm but every law firm who uses their software. Its potentially a federal class action lawsuit.

My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.

>My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.

If your argument is that the danger of equal values being selected inconsistently breaks determinism, that's a trivial problem to solve.

Any non-infinite precision numbering system by definition is at the limits of it's precision when equal values occur. If you need to order such values you can extend the precision and add on a deterministically unique tiny value (position, order encountered, etc.) . Your original value stays in the same precision range but they are now unique.

It's usually more likely that you want to sacrifice a little precision for determinism so you can quantise to allocate the range where you apply the unique ID

For example if you had an array of 256 fp32 values but you required them to be unique, you can lop off 8 bits of mantissa and replace it with its index in the array, Every value is then unique.

Granted token dictionaries make for some fairly hefty indexes now, but the principle applies in general, it's easily solvable if you are prepared to spend some precision or do some extra calculation.

> for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal.

In one thinking trace of 10k tokens, with fp16 or bf16 logits, I don't reckon a collision is rare? There are only 65k floating point numbers with that accuracy. And an agent can quickly rake up 100k tokens, so while not every token will have such a collision of equiprobable logits, it is not rare.

> It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients.

I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".

The point is that the case T=0 doesn't just "exist" as a special code branch - it is still well defined mathematically without any change to the output function. What the above comment refers to with the extra "if" check is just a limitation of computers not liking to divide anything by zero, even if the actual function exists and is well behaved at zero. It is not some weird or special theoretical construction.

Floating point defines n/0 the same as math. It's infinity as long as n isn't zero.

In almost all forms of math, the value n/0 is undefined. It's definitely not infinity, for two reasons - depending on the value of n, it can be negative; and neither info nor -inf are numbers, so they can't be the result of an equation (unless you look at transfinite equations).

What you can do in math is talk about the limit of a series of fractions as the denominator approaches 0, and that's where you get some relation to infinity or -infinity. But the limit can also be any other number, if the numerator also gets closer to 0; or it can not exist, if the function oscillates.

I explicitly didn't say "infinity or negative infinity" because I didn't think that level of pedantry would be needed here on HN. I guess I was wrong.

It's not positive or negative infinity. It is simply undefined. Math has many conventions, and you can define your own convention that it does equal some flavor of infinity, but that is only a convention, and not a universal one.

All discussions of mathematics assume maximal possible pedantry.

That's not the problem, and this is not just pedantry. It's just not correct to say that n/0 = inf, nor even to say that positive_n / 0 = inf, in any normal math context.

For example, if you accepted that n/0 = inf just like n/1 = n, then you'd conclude that n/0 + 3 = inf + 3 = inf, so n/0 + 3 = n/0, so 3 = 0. Or you'd want to do weird things like asking what is sin(inf).

> as long as n isn't zero

Which is the case with softmax function, as for T=0 you end up with a fraction that either becomes 0/0 or inf/inf [0]. So you do need branching as floating point arithmetic is not gonna get you there.

[0] except for weights that are exactly 0

edit: thinking more about it, one could always express the softmax formula in ways that this could work with floating point arithmetic but it would be very inefficient and sort of pointless

Even if it's deterministic that doesn't mean it isn't arbitrary. I can achieve determinism at any temperature by saving the seed. But that wouldn't make rejects feel much better knowing that if a bit was flipped in an arbitrary seed they would be scored differently.

I did large scale tests temp 0 and there was still randomness with the same prompt inputs coming in.

I did this with several model apis.

GPU processing is not going to be the same from what I read but also the AI backend is doing a lot of fancy batching resulting in another layer of randomness.

> Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0.

That's not how limits work. As the temperature goes to 0, the rest goes to 0. That's it. The "almost-but-not-quite" is part of the "goes to".

Let's say f(x) = 3x+1. It's a continuous function. If we let x go to 10, f(x) goes to 31. Not "almost-but-not-quite 31". No, to 31. (If you don't have a continuous function then it's the same argument, but less intuitive to illustrate.)

It is not deterministic because the order of computations in a typical multithreaded system is not deterministic and also because when combined with the devil that is IEEE754, it gets even less deterministic.

> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

Exactly. While I’m assuming this won’t be news for most here, for those that are still new and/or curious about some more explanation on e.g. the floating-point imprecisions, see this nice article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).

But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.

They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.

It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.

How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.

Batching order, as you mentioned, matters a lot, and for any heavily optimized kernels it will change from one machine to the next. You also have the choice of backend numerical library from, e.g., different OS versions. There are floating-point bugs from time to time, especially in GPUs. Many operations (like transcendentals) are usually given a couple bits of wiggle room in the result. Another program executing could have changed the floating-point rounding mode on one device. More aggressive ML optimizers might automatically apply various forms of reduced precision to the requested high-level operation. If you have enough optimizations enabled, you might non-deterministically get compiled instructions like fmadd so that any one build of your library is deterministic (excluding other ideas mentioned above) but different machines with different builds (because of a staged rollout, different architectures, engineering mistakes, etc) can have different outputs. And so on.

IEEE-754 doesn't mandate exact results for functions like exp(x). It mandates things like "within 2 ULP of the true answer." Hardware vendors are free to implement these functions in any way that meets the error tolerance.

[deleted]

While the IEEE 754 standard ensures that individual basic operations are deterministic and strictly bounded, it does not guarantee that an entire program will yield bit-identical results on all CPUs.

CPUs and their execution environments introduce subtle hardware variations, architecture choices, and compiler optimizations that break bit-level consistency.

(same for GPU/TPU, ...)

Parent is correct - the math is very deterministic if you can guarantee it’s running repeatedly on the same machine and you’re not processing “random” requests in parallel. The compiler is irrelevant because once the code is generated it’s not getting recompiled and thus isn’t a source of non determinism (and generally if you don’t touch the math the compiler will consistently emit the same underlying machine code).

This sub-thread was about cloud environments, where different requests may be served by different hardware. And it's in fact very likely that there will be a mix of different hardware from different vendors, in any particular LLM cloud for now.

It is, after all, a fundamentally voltage-based process, and the logical “no-man’s land” is chosen to limit the likelihood of a weak component producing faulty logic, but it’s impractical to run through the set of all possible starting states and to verify that after an unbounded number of clock steps the machine reaches a predictable end state on all of the devices being manufactured.

Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.

They are only non-deterministic when you’re doing batching and a kernel ends up running across a “random” set of token streams. If you’re only processing one user’s request, they’re very much deterministic.

that's incorrect in the presence of batching. it's tough work making it truly deterministic:

https://x.com/FireworksAI_HQ/status/2069873437217276015

It's not that hard. What is hard is making it truly deterministic and retain high throughput.

PRNG is deterministic.

If you make an exact integer implementation and run with temp=0 it's deterministic.

You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.

But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.

> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run

The implementation does not often differ run by run.

> The implementation does not often differ run by run.

If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen