Hacker News

You mean "corporate inference infrastructure", not LLMs. The reason for different outputs at t=0 is mostly batching optimization. LLMs themselves are indifferent to that, you can run them in a deterministic manner any time if you don't care about optimal batching and lowest possible inference cost. And even then, e.g. Gemini Flash is deterministic in practice even with batching, although DeepMind doesn't strictly guarantee it.

This is all currently irrelevant, making it work well is a much bigger problem. As soon as there's paying demand for reproducibility, solutions will appear. This is a matter of business need, not a technical issue.