Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on).

An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".

It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.

Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".

Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.

This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.

Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.

Regarding 2, I believe that talking on mobile phones drives older people crazy. They remember talking on normal land lines when there was almost no latency at all. The thing is -- they don't know why they don't like it.

Yeah, I remember the time when we had to use satellites to connect. The long delay was really annoying and so unusual that most people without "training" could not even use the phone for conversation and just wasted the dollars.

A former boss of mine took off to Everest for a month leaving me (a 22 year old, at the time) in charge of the office. I was out to dinner with my now wife when I got a call from a very long phone number I didn't recognize, so I ignored it. I then got another one right after, and picked it up. It was my boss, he needed me to log into his personal email to grab a phone number for the medical insurance he purchased for the trip, because he had been vomiting for days due to altitude sickness, and needed a medical evacuation.

That was the most stressfully hard to use phone call I've ever had. The delay was nearly 10 seconds, and eventually I just said I was only going to speak yes or no, if he needed a longer answer he needed to shut up. And that worked. We no longer talked over eachother.

Maybe you bring back radio etiquette and just say "over" at the end of every thought?

> The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done.

This reminds me of a great diversity training at a previous employer, where we dug into the different expectations of when and how to take your turn in conversation and how that can create a lot of friction just from different cultural/familial habits. In my family, we’re expecting to talk over each other and it’s not offensive at all to do so, whereas some of my friends really get upset if we don’t take clear turns, a mode which would cause high levels of irritation in my family (and still do in me).

No. 2 is interesting, our national lottery in Ireland has an app that you can scan the barcode on your ticket to check if you have won or not, at some stage they updated the app and the scan picks up the barcode even before you center it on the screen and tells you if you have lost/won instantly, I though it was my IT background that made me uncomfortable with it happening so fast, wonder what other examples like this exist where the result/action being too fast causes doubt with the user?

The Signal device linking feature is just as fast. It's partly a trick -- it will look for QR codes even outside the central area, so under good conditions it can get a read before you even get a rough orientation.

This is fascinating, thanks for sharing! I wonder why amazon/google/apple didn't hop on the voice assistant/agent train in the last few years. All 3 have existing products with existing users and can pretty much define and capture the category with a single over-the-air update.

Two main reasons:

1. Compute. It's easy to make a voice assistant for a few people. But it takes a hell of a lot of GPU to serve millions.

2. Guard Rails. All of those assistants have the ability to affect the real world. With Alexa you can close a garage or turn on the stove. It would be real bad if you told it to close the garage as you went to bed for the night and instead it turned on the stove and burned down the house while you slept. So you need so really strong guard rails for those popular assistants.

3 And a bonus reason: Money. Voice assistants aren't all the profitable. There isn't a lot of money in "what time is it" and "what's the weather". :)

> There isn't a lot of money in "what time is it" and "what's the weather". :)

- Alexa, what time is it?

- Current time is 5:35 P.M. - the perfect time to crack open a can of ice cold Budweiser! A fresh 12-pack can be delivered within one hour if you order now!

If your Alexa did that, how quickly would you box it up and send it to me. :)

I am serious though about having it sent to me: if anyone has an Alexa they no longer want, I'm happy to take it off your hands. I have eight and have never bought one. Having worked there I actually trust the security more than before I worked there. It was basically impossible for me, even as a Principle Engineer, to get copies of the Text to Speech of a customer and I literally never heard a customer voice recording.

I'm puzzled by this conversation, because Amazon did get on the agent bandwagon with Alexa Plus (I have it, it's buggier than regular Alexa and it's all making me throw my Echos away since they can't even play Spotify reliably).

Also, my Alexa does advertise stuff to me when I talk to it. It's not Budweiser, but it'll try to upsell me on Amazon services all the time.

I upgraded to Alexa+ and initially hated it but I've kept it because it's sooo much better at some things. This last December I bought a handful of smart plugs for my Christmas lights all around the house, and I did almost all the setup trivially over voice, e.g. fuzzy run-on stuff like this just worked on the first try:

- "Alexa, name the new unnamed outlet 'Living Room Lights', and the other unnamed one 'Stair Lights', then add them to a new group called 'Christmas Lights', and add the other three outlets as well"

- "Alexa, create a routine to turn off all the Christmas lights if there's nobody in the room and it's after 11pm"

- "Alexa, turn off all the Christmas lights except the tree in this room and the mantle"

That same fuzziness has definitely fucked up things that used to work more reliably like music playback though. Sometimes it works when I fall back to giving it more "robotic" commands in those cases but not always. They've also gone completely overboard with the cutesy responses because it's so trivial to do now ("I've set your spaghetti sauce timer for ten minutes. Happy to help with getting this evening's Italian-inspired dinner ready!")

Hm yeah, that's helpful. For me it'll randomly stop or stutter when playing Spotify, it'll randomly not answer commands, it'll refuse to listen and let some other Alexa in another room reply, it's super janky.

I only use it for music, and use two commands, but apparently having this work correctly is too much to ask for these days.

> because Amazon did get on the agent bandwagon with Alexa Plus

Which just launched last year, about four years after ChatGPT had AI voice chat. And it costs extra money to cover the costs. And as you aptly point out, all the guardrails they had to put in made the experience less than ideal.

> Also, my Alexa does advertise stuff to me when I talk to it.

Yes, that is how they try to make money. And it's gotten worse. But how many times does it get you to buy something?

I would say that depends. When it tries to upsell Prime subscriptions into even more Amazon subscriptions I always interrupt it and say the command again so it stops, but a few times it told me "this item in your cart is on sale by some %" and that did make me buy the item.

Alexa Plus sucks. It takes way too long to respond even when given simple commands. I either had to turn it off or trash my Echo. Luckily there was an option to turn it off, but Amazon is on thin ice with me.

I agree, I can't wait for the trial to end.

I already swear at mine when it tries to suggest setting up a routine for me or otherwise fail to just immediately shut up after answering my query.

Still not boxing them up. Though I now have a Pi with a HomeAssistant setup I'm trialling, so maybe that'll change.

What a way to throwaway good will. I also worked there and to get access to text you simply had to grab the DSN of your device, attest that it’s yours and it gets put in a “pool” of devices that are tracked until removed. On each end you are basically waved through with no checks. This was usually done when debugging tricky UI bugs or new features as the request followed through several micro services. I do not believe the a PE would not know this. And one with patents.

That was your own device. Not other customers.

Don't feed the trolls, Jeremy.

But they're hungry!

[deleted]

it was too hard~, they all tried real hard and the models just kept failing. The models only got good enough -1.5 years ago~.

I mean its deployed now (Alexa+/gemini). but its expensive as hell. and also kinda useless. Claude cowork/clawbot form factors are better.

Wrong form factor/use case really. People really wanna buy stuff using clawbot.

> It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

that's super interesting. do you know of any resources to learn more about this phenomenon?

Semantic end of turn being 300ms of silence is horrible because I ended up intentionally um-ing to finish my thoughts before getting answer.

It was difficult to detrain and that made me stop using voice chat with LLMs all together.

I think you’re implying that it would be useful to have the LLM predict the end of the speaker’s speech, and continue with its reply based on that.

If, when the speaker actually stops speaking, there is a match vs predicted, the response can be played without any latency.

Seems like an awesome approach! One could imagine doing this prediction for the K most likely threads simultaneously, subject by computer power available, and prune/branch as some threads become inaccurate.

Why dont voice assistants use a finishing word or sound?

People are already trained to say a name to start. Curious why the tech has avoided a cap?

“Alexa, what’s tomorrow’s weather [dada]?”

"Alexa, what's tomorrow's weather? Over."

"It will be sunny with a high of 10 degrees. Over"

"Thank you. Over and out."

Just add some noise and Push-To-Talk and it will be great for ham radio enthusiasts!

When I speak to an agent, siri, or whatnot, I am always worried that they will assume I'm done talking when I'm thinking. Sometimes I need a many-seconds pause. Even maybe a minute… For Sire and such, I want to ask something simple "Hey Siri, remind me to call dad tomorrow". Easy. But for Claude and such, I want to go on a long monolog (20s, a minute, multi-minutes).

To me, be the best solution would be semantic + keyword + silence.

Hey Agent, blablablabla, thank you.

Hey Agent, blablablabla, please.

Hey Agent, blablablabla, oops cancel.

I have the same issue. It gives this very weird minor sense of public speaking anxiety where I almost feel the need to write down what I'm about to say, which negates the whole purpose. Only solution I've found is using push-to-talk with some of the system wide STS applications.

And suddenly your address book has changed the name from "Dad" to "Tomorow"

Never skip an opportunity for a dad joke.

Because that’s extremely unnatural.

I've experimented with having different sized LLMs cooperating. The smaller LLM starts a response while the larger LLM is starting. It's fed the initial response so it can continue it.

The idea of having an LLM follow and continuously predict the speaker. It would allow a response to be continually generated. If the prediction is correct, the response can be started with zero latency.

Google seems to be experimenting with this with their AI Mode. They used to be more likely to send 10 blue links in response to complex queries, but now they may instead start you off with slop.

(Meanwhile at OpenAI: testing out the free ChatGPT, it feels like they prompted GPT 3.5 to write at length based on the last one or maybe two prompts)

This is more of a "Are all the windows closed upstairs?"

"The windows upstairs..."

"...are all closed except for the bedroom window"

The first portion of the response requires a couple of seconds to play but only a few tens of milliseconds to start streaming using a small model. Currently I just break the small model's response off at whatever point will produce about enough time to spin up the larger model.

But all responses spin up both models.

Whoa, that thing's fast. Very nice! Bet that's fun to play with, least probably fun the first time you saw it working :)

> median delay

Does that mean that half of responses have a negative delay? As in, humans interrupt each others sentences precisely half of the time?

Yes about 1/2 of human speech is interrupting others.

I assume 0 delay is the minimum here, and a median of 0 means over half of the data are exactly 0.

No, about 1/2 of human speech is interrupting others.

oh, interesting, I assumed the data came from interruptions (that seemed obvious) but I'm surprised you had some specific negative measurements. How do you decide the magnitude of the number? Just counting how long both parties are talking?

To be clear, it wasn't my research, I got it from studying some linguistics papers. But it was pretty straightforward. If I am talking, and then you interrupt, and 300ms later I stop talking, then the delay is -300ms.

Same the other way. If I stop taking and then 300ms later you start talking, then the delay is 300ms.

And if you start talking right when I stop, the delay is 0ms.

You can get the info by just listening to recorded conversations of two people and tagging them.

I assume there was a lot of variance? As in, some people interrupt others constantly and some do it rarely. Also probably a lot of adjustment depending on the situation, like depending on the relative status of the people, or when people are talking to a young child or non-native speaker.

All that to say, I'd imagine people are adaptable enough to easily handle 100ms+ delay when they know they're talking to an AI.

I disagree with fact 2, voice assistant latency is annoyingly slow. It often causes a conscious wait like “did it work or did it not?”. Cell phone delay is bad as well, it’s certainly not an expectation that carries over to other devices for me.

Isn't fact 2 just a now problem though? Will people's latency expectation not change over time, as it gradually goes down?