The issue is even bigger: phonemizer is using espeak-ng, which isn't very good at turning graphemes into phonemes. In other TTS which rely on phonemes (e.g. Zonos) it turned out to be one of the key issues which cause bad generations.
And it isn't something you can fix, because the model was trained on bad phonemes (everyone uses Whisper + then phonemizes the text transcript).