> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
> This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
I know but it was more of a general comment. A really good TTS just isn't around yes in the OSS sphere. I looked at some of the other suggestions here but they have too many quirks. Dia sounds great but messages must have certain lengths etc and it picks a random voice every time. I'd love to have something self hosted that's as good as openai.