By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.
And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.
I want to be able to click the "Stop" button on my earphones remote. I want to be able to interject "woah" or "stop!" or "wait!" or that it would detect that I've inhaled a breath, or that my eyes glazed over. I want the LLM to figure out that every speed setting for its voice output is in "auctioneer" territory rather than "lecturing university professor with tenure and a pension" pacing.
But we won't get any of that, because the prime directive of LLMs is to burn tokens like there's no tomorrow. Burn tokens on a naïve answer without asking clarifying questions. Burn tokens on writing, debugging, and running a Python script or accessing and parsing 10 websites without asking for consent. Burn tokens on half-baked images with misspellings and 31 fingers. Burn tokens arguing "how many 'r's in strawberry?". Burn tokens asking a followup question at the end of every single answer, begging the user to re-engage and burn more tokens.
There is a little red "Stop" control when text output is being produced, at least, but does "Stop" halt everything and throw away the context? Re-prompt from the beginning?
The "maximize tokens burnt" prime directive is not to be found in any system prompt or user documentation. It is seemingly a common feature of the training for any consumer model.
Currently, if I'm using voice for an LLM, I use the voice dictation in the keyboard feature, because then the response is in text. There is no way to prevent "responding in kind" if I query the thing with audio. Or in Swahili.
newer models tend to use fewer thinking tokens to solve the same problems, and is a strong counterexample to your entire comment