Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.

This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.

Since then, the trend has been to scale up. We need more models to scale down.

In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.

Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).

> This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

I know but it was more of a general comment. A really good TTS just isn't around yes in the OSS sphere. I looked at some of the other suggestions here but they have too many quirks. Dia sounds great but messages must have certain lengths etc and it picks a random voice every time. I'd love to have something self hosted that's as good as openai.

The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.

Thanks I'll try! I like how it sounds, the quality is really good. But the limitations are really severe (shorter than 5 seconds is not ok, > 30 seconds is not ok, it will play a random voice every time, those make it pretty much unusable for an assistant to be honest).

But it might be worth setting it up and seeing if it improves over time.

You can get consistent voice by providing a sample - and yea the timing stuff is what you have to work around - have to basically chunk your inputs.

Imho chatterbox is the current open weight SOTA model in terms of quality: https://huggingface.co/ResembleAI/chatterbox

Thank you, I hadn't heard of it. Will have a look! The samples sound excellent indeed.

You should give try to https://pinokio.co/

Thanks I'll try!

Chatterbox is also worth a try.