Kokoro is fine for TTS, but it lacks emotion. But for a model of this size, that is kind of given.

I played with ebook generation a bunch and find that (at least for English text) around 1B is needed to get something usable emotionally (Chatterbox is 0.5B, Orpheus is 3B).

Ironic given the name: kokoro is Japanese for heart or sentiment.