I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative

When GP mentioned reducing conversational latency as a negative that made sense (and should probably be done IMO), it just wasn't the same category of latency the article talks about reducing. I.e. increasing "network latency" just makes the conversation feel more and more out of sync, it doesn't change the rate at which the AI will interrupt ("turn latency") because the latter is based on the duration of the pause in the audio stream, not the duration it took to deliver that audio stream.

If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.

By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.

And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.

I want to be able to click the "Stop" button on my earphones remote. I want to be able to interject "woah" or "stop!" or "wait!" or that it would detect that I've inhaled a breath, or that my eyes glazed over. I want the LLM to figure out that every speed setting for its voice output is in "auctioneer" territory rather than "lecturing university professor with tenure and a pension" pacing.

But we won't get any of that, because the prime directive of LLMs is to burn tokens like there's no tomorrow. Burn tokens on a naïve answer without asking clarifying questions. Burn tokens on writing, debugging, and running a Python script or accessing and parsing 10 websites without asking for consent. Burn tokens on half-baked images with misspellings and 31 fingers. Burn tokens arguing "how many 'r's in strawberry?". Burn tokens asking a followup question at the end of every single answer, begging the user to re-engage and burn more tokens.

There is a little red "Stop" control when text output is being produced, at least, but does "Stop" halt everything and throw away the context? Re-prompt from the beginning?

The "maximize tokens burnt" prime directive is not to be found in any system prompt or user documentation. It is seemingly a common feature of the training for any consumer model.

Currently, if I'm using voice for an LLM, I use the voice dictation in the keyboard feature, because then the response is in text. There is no way to prevent "responding in kind" if I query the thing with audio. Or in Swahili.

newer models tend to use fewer thinking tokens to solve the same problems, and is a strong counterexample to your entire comment

[deleted]
[deleted]