Currently running some finetuning experiments on non-verbal sounds to teach TTS how to laugh. I have had some success to add the necessary tags and tokens to multiple systems, but assembling the necessary dataset with sufficient quality is hard.
Currently running some finetuning experiments on non-verbal sounds to teach TTS how to laugh. I have had some success to add the necessary tags and tokens to multiple systems, but assembling the necessary dataset with sufficient quality is hard.