I'm curious why smallish TTS models have metallic voice quality.
The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?
I'm curious why smallish TTS models have metallic voice quality.
The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?
Probably "metallicity" is due to lack of details and cannot be fixed that easy.
We change our tone based on personal style, emotion, context, and other factors. An accurate generator might need to encode all that information in the model. It will be larger than a model that doesn't do all of that.