Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.
Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).
But my money is on the exact two mechanisms the OP proposes.
> especially if quality degrades at all
It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.
You can jiggle sampling settings around without the seed changing. That’s identical in practice but even more sneaky. (Though it wouldn’t speed up inference unless they were dumb enough to do beam search and turned that off!!!)
Yeah they can’t tell, but also there’s lots of incentive for major LLM providers to lie about not doing something that would massively save their inference costs if they did.
Wait sorry how did you use and expose seeds? That’s the most interesting part of your post
We were not a ChatGPT wrapper; we used a finetuned open-source model running on our own hardware, so we naturally had full control of the input parameters. I apologize if my language was ambiguous, but by "expose seeds" I simply meant users can see the seed used for each prompt and input their own in the UI, rather than "exposing secrets" of the frontier LLM APIs, if that's what you took it to mean.
I just wanted deterministic outputs and was curious how you were doing it. Sounds like probably temp = 0, which major providers no longer offer. Thanks for your response.
No, seed and temperature are separate parameters accepted by the inference engine. You can still get deterministic outputs with high temp if you're using the same seed, provided the inference engine itself operates in a deterministic manner, and the hardware is deterministic (in testing, we did observe small non-deterministic variations when running the same prompt on the same stack but a different model of GPU).