Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.

Not entirely related but humans have the same problem.

For scriptwriting when doing voice overs we always explicitly write out everything. So instead of 1 000 000 we would write one million or a million. This is a trivial example but if the number was 1 548 736 you will almost never be able to just read that off. However one million, five hundred and forty eight thousand, seven hundred and thirty six can just be read without parsing.

Same with urls, W W W dot Google dot com.

Regarding humans, yes and no. If a human had constantly problems with 3 and 4 digit numbers like tts-1-hd does, I'd ask myself if they were neurodivergent in some way.

And yes, I added instructions along the lines of what you describe to my prompt. Its just sad that we have to. After all, LLM TTS has solved a bunch of real problems, like switching languages in a text, or foreign words. The pronounciation is better then anything we ever had. But it fails to read short numbers. I feel like that small issue could probably have been solved by doing some fine tuning. But I actually dont really understand the tech for it, so...

From the web demo this model is really good at numbers. It rushes through them, slurs them a bit together, but they are all correct, even 7 digit numbers (didn't test further).

Looks like they are sidestepping these kinds of issues by generating the phonemes with the preprocessing stage of traditional speech synthesizers, and using the LLM only to turn those phonemes into natural-ish sounding speech. That limits how natural the model can become, but it should be able to correctly pronounce anything the preprocessing can pronounce