I'm not sure current tech could reliably take that order, honestly. There's essentially 0 chance it would try to disambiguate the meaning of "one of them", and from there it's a tossup whether you'll get a double cheeseburger, a double box of fries, or double mayo.
Current tech is pretty dang close. I gave the order to ChatGPT and it parsed it almost perfectly [0], even handling the ambiguity about what happens if you add a combo to an order that already includes several fries à la carte. The only thing it missed is that I didn't actually order the combo (but merely want to know how much the upgrade is), but I'm sure some fine-tuning could solve that. (Come to think of it, a fast food restaurant would consider this implicit upsell as a feature.)
The main challenge AI would face is people who come by at 3 AM drunk and stoned, indecisively slurring through their order, but I imagine there'd be a system to redirect these edge cases to an actual human.
[0] https://chatgpt.com/share/68ba2233-9f48-8011-905a-c69cc5e91b...
Pretty dang close isn't the same as accurate for an exchange of time and money. Voice->text, with a noisy background, is a particularly hard problem. Especially with hardware not designed to limit background noise. Try it. Whisper is still the leading speech->text model in our tests, but add noise reduction, echo, diarization, etc. It's a hard problem.
>Come to think of it, a fast food restaurant would consider this implicit upsell as a feature.
Yeah, just what every restaurant manager wants: to deal with customers who paid more for things they didn't order.
It can't. Not reliably. I think every major chain that was trying it has ripped it out.
It'll definitely be a thing within 5 years, max, but it's not mature enough for production yet