Why do you care about determinism in a probabilistic system? What difference does it make to the end user if the input "How do I X?" always produces the same deterministic output when semantically equivalent inputs "how do i x?", "how do I x", and "how do I X??" are bound to produce different answers that often won't even be semantically equivalent.

What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.

Not all LLM based applications are a user facing free form chat.

If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome. Same applies to running your prompt through DSPy Optimizer. [0] Countless other examples. Basically any situation where you are in control of the prompt, the token level input to the LLM, so there's no fuzziness.

In this case, if you would've eliminated token level fuzziness and can yourself guarantee that you're not introducing it from your own end, you can basically map out a much more reliable tree or graph structure of your system's behavior.

[0]: https://dspy.ai/#2-optimizers-tune-the-prompts-and-weights-o...

> If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome

why use an ambiguous natural language for a specific technical task? i get that its a cool trick but surely they can come up with another input method by now?

You aren't wrong, but that doesn't mean this level of determinism isn't useful. If you don't even have the level of determinism that the exact same input tokens produce the exact same output tokens, then it's very hard to share reproducible results with peers, which can be useful if you are say, red teaming an LLM to produce a very rare / unreliable output.

I'm actually working on something similar to this where you can encode information into the outputs of LLM's via steganography: https://github.com/sutt/innocuous

Since I'm really looking to sample the only the top ~10 tokens, and I mostly test on CPU-based inference of 8B models, there's probably not a lot of worries getting a different order of the top tokens based on hardware implementation, but I'm still going to take a look at it eventually, and build in guard conditions against any choice that would be changed by an epsilon of precision loss.

It would be very useful for AI platform customers. You could run prompts with 0 temperature and check if the results are the same making sure that AI provider is not switching the PRO model in the background for a cheap one and ripping you off.

For "bug" reproduction purposes. It is easier to debug a model if the same string always produces the same incorrect or strange LLM output, not every 100th time you run it.

If there is a bug (a behavior defined by whatever criteria), it is just a single path in a very complex systems with high connectivity.

This nonlinear and chaotic behavior regardless of implementation details of the black box makes LLM seem to be nondeterministic. But LLM is just a pseudo random number generator with a probability distribution.

(As I am writing this on my iPhone with text completion, I can see this nondeterministic behavior)

Deterministic output is needed when LLMs are used for validations. This can be anything from input validation at runtime to a CI check leveraging LLMs. It can be argued this is not an acceptable use of AI, but it will become increasingly common and it will need to be tweaked/tested. You cannot tweak/test a response you don't know you're going to get.

yeah indeed, regression testing for chatbots that use RAGs would involve making sure the correct response comes from the RAG.

Today we have a extremely hacky workaround by ensuring that at least the desired chunk from the RAG is selected, but it's far from ideal and our code is not well written (a temporary POC written by AI that has been there for quite some months now ...)

When you do MCP-style applications, an LLM is more like RegEx on steroids, and since you expect your regex to return the same matches on the same input, it is a very desirable attribute for LLMs as well. I would say it is more than desirable, it is necessary.

If i want to covert "how do I x" to `api.howTo("x")` it is very important that i get the exact same result every time.

Was my thinking exactly - but also semantically equivalent is also only relevant when it needs to be factual, not necessarily for ALL outputs (if we're aiming for LLM's to present as "human" - or for interactions with LLMs to be natural conversational...). This excludes the world where LLMs act as agents - where you would of course always like the LLM to be factual and thus deterministic.

I agree that we need stochasticity in a probabilistic system, but I also think it would be good to control it. For example, we need the stochasticity introduced at high temperatures since it is inherent to the model, but we don’t need stochasticity in matrix computations, as it is not required for modeling.

I don't think the claim is that this is particularly helpful for consumer-facing applications. But from a research perspective, this is invaluable for allowing reproducibility.

Easier to debug deterministic inference