Recently I've been playing with Chatterbox and the setup is a nightmare. It specifically wants Python 3.11. You have 3.12? TS. Try to do pip install and you'll get an error about pkg-config calling a function that no longer exists, or something like that.
God, I hate Python. Why is it so hard to not break code?
I experienced that recently - just curious, since you're digging into voice synth, what are open-source voice synth (specifically text-to-speech) which have been working for you. Recently, I have tried PiperTTS (I found the voices very flat, and accented), Coqui (in the past - it wasn't great, and doesn't seem to be supported). I spent a ton of time trying to get Chatterbox to work (on Debian Linux 13) - and ultimately couldn't get the right mix of Python versions, libraries etc. At this moment, I'm using AWS Polly and ElevenLabs (and occasionally MacOS `say`), but would love to have an open-source TTS which feels quality, and I can psychologically invest in. Thanks for any perspective you can share.
>I spent a ton of time trying to get Chatterbox to work (on Debian Linux 13)
Exactly my case. I had to move back to Debian from Ubuntu, where I had installed Chatterbox without much difficulty, and it was hell. You pretty much need Anaconda. With it, it's a cinch.
>what are open-source voice synth which have been working for you.
I tried a few, although rather superficially. Keeping in mind that my 3090 is on my main (Windows) machine, I was constrained to what I could get running on it without too much hassle. Considering that:
* I tried Parler for a bit, although I became disillusioned when I learned all models have an output length limit, rather than doing something internally to split the input into chunks. What little I tried with it sounded pretty good if it stayed within the 30-second window, otherwise it became increasingly (and interestingly) garbled.
* Higgs was good. I gave it one of Senator Armstrong's lines and made it generate the "mother of all omelettes" one, and it was believable-ish; not as emphatic but pretty good. But it was rather too big and slow and required too much faffing around with the generation settings.
* Chatterbox is what I finally settled with for my application, which is making audiobooks for myself to listen to during my walks and bike rides. It fits in the 3070 I have on the Linux machine and it runs pretty quick, at ~2.7 seconds of audio per second.
These are my notes after many hours of listening to Chatterbox:
* The breathing and pauses sound quite natural, and generally speaking, even with all the flaws I'm about to list, it's pleasing to listen to, provided you have a good sample speaker.
* It you go over the 40-second limit, it handles it somewhat more graciously than Parler (IMO). Instead of generating garbage it just cuts off abruptly. In my experience splitting text at 300-350 characters works fairly well, and keeping paragraphs intact where possible generates best results.
* If the input isn't perfectly punctuated it will guess at the sentence structure to read it with the correct cadence and intonation, but some things can still trip it up. I have one particular text where the writer used commas in many places where a period should have gone, and it just cannot figure out the sentence structure like that.
* The model usually tries to guess emotion from the text content, but it mostly gets it wrong.
* It correctly reads quoted dialogue in the middle of narration, by speaking slightly louder. If the text indicates a woman is speaking the model tries to affect a high pitch, with varying degrees of appropriateness in the given context. Honestly, it'd be better if it kept a consistent pitch. And, perplexingly, no matter how much the surrounding text talks about music, it will read "bass" as "bass", instead of "base".
* Quite often the model inserts weird noises at the beginning and end of a clip which will throw you off until you learn to ignore them. It's worse for short fragments, like chapter titles and the like. Very rarely it inserts what are basically cut-off screams, like imagine a professional voice actor is doing a recording and just before he hit stop someone was murdered inside the booth.
* It basically cannot handle numbers more than two digits long. Even simple stuff like "3:00 AM" it will read as complete nonsense like "threenhundred am".
* It also has problems with words in all caps. It's a tossup if it's going to spell it out, yell it, or something in between. In my particular case, I tried all sorts of things to get it to say "A-unit" (as in a unit with the 'A' designation) properly, but sometimes it still manages to fuck it up and go "ah, ah, ah, ah, ah, ah unit".
* Sometimes it will try to guess the accent it should use based on the grammar. For example, I used a sample from a Lovecraft audiobook, with a British speaker, and the output will sometimes turn Scottish out of nowhere, quite jarringly, if the input uses "ya" for "you" and such.
Thank you - this is helpful. I didn't realize how important I was going to value consistency over quality voice, but then when you've got to go back and listen to everything for quality control ... I guess that is the drawback of this phase of "generative" voice synth.
Yeah, in that way it's a lot like image generation. Maybe a single output is good in isolation, but if you want to generate a series maintaining some kind of consistent style, it's very much like a lottery. The models don't have dials to control emphasis, cadence, emotiveness, accent, etc., so they guess from the content. For example, imagine a serious scene that calls for a somber tone, but then one of the characters makes a dark or ironic joke. A human would maintain the same reading voice, but these models would instead switch to a much more chipper register for that one line.