Thank you - this is helpful. I didn't realize how important I was going to value consistency over quality voice, but then when you've got to go back and listen to everything for quality control ... I guess that is the drawback of this phase of "generative" voice synth.

Yeah, in that way it's a lot like image generation. Maybe a single output is good in isolation, but if you want to generate a series maintaining some kind of consistent style, it's very much like a lottery. The models don't have dials to control emphasis, cadence, emotiveness, accent, etc., so they guess from the content. For example, imagine a serious scene that calls for a somber tone, but then one of the characters makes a dark or ironic joke. A human would maintain the same reading voice, but these models would instead switch to a much more chipper register for that one line.