Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.
I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.
I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.
But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.
The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".
It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.
On Mac OS you can "speak" a text in almost every app, using built in voice (like the Siri voice or some older voices). All offline, and even from the terminal with "say".
I tried it a few months ago to narrate an epub in Apple Books and it was very broken in a weird way. It starts out decent but after a few pages, it starts slurring, skipping words, trailing off not finishing sentences and then goes silent.
(I've just tried it again without seeing that issue within a few pages)
> Siri voice or some older voices
You can choose "Enhanced" and "Premium" versions of voices which are larger and sound nice and modern to me. The "Serena Premium" voice I was using is over 200Mb and far better that this Show HN. It's very natural but kind of ruined by diabolical pronunciation of anything slightly non-standard which sadly seems to cover everything I read e.g. people/place names, technical/scientific terms or any neologisms in scifi/fantasy.
It's so wildly incomprehensible for e.g. Tibetan names in a mountaineering book, that you have to check the text. If the word being butchered is frequently repeated e.g. main character’s name, then it's just too painful to use.
Can't most people read faster than they can hear? Isn't this why phone menus are so awful?
> But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there
As people have been pointing out, we've had mediocre TTS since the 80s. If it was a real benefit people would be using even the inadequate version.