Speech speed is always a tunable parameter and not something intrinsic to the model.

The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.

I think he meant overacting typical for English dubs.

The voices sound artificial and a bit grating. The male voices especially are lacking, especially in depth: only the ultimate voice has any depth at all, while the others sound like teenagers who haven't finished puberty. None of the voices sound quite human, but they're all very annoying, and part of that is that they sound like they're acting.

I heard a little DVa from Overwatch.