I've been using Kokoro TTS with the CLI app, audiblez, mentioned in the "Similar Projects" section of the README. The model is fast and delivers impressive quality for its small size. Some issues I have faced, however, are: a) It doesn't distinguish periods at the end of sentences from the dots in abbreviations such as "Mr." or "Mrs." The result is an awkward pause between "Mr." and the name. b) It doesn't handle ellipses well. c) Words are pronounced the same way regardless of context.

I fixed that here: https://github.com/cpttripzz/audiblez The main problem with Kokoro is how flat and lifeless it sounds. But it is very fast. I prefer Chatterbox tts but it is around 20 times slower and will not work without a GPU

Look into SSML phoneme tags. Some TTS supports it. That was you can use a powerful LLM to fix these issues ahead of TTS

The Mr. / Mrs. thing feels like it would be a pretty easy fix, at least to eliminate a lot of the more common cases.

^ A thought that everyone has had at one point when processing human text before learning the hard way (like end of sentence detection). :P

The difference is that even weak LLMs are good at magically doing this, so I wonder what the problem is for the TTS mentioned above.

Kokoro is small and fast because all the text -> phoneme conversion is done by “dumb code” and only the phoneme -> sound part is done using a neural net.