I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.
I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.
That's a great point about the dependencies.
To make the setup easier and add a few features people are asking for here (like GPU support and long text handling), I built a self-hosted server for this model: https://github.com/devnen/Kitten-TTS-Server
The goal was a setup that "just works" using a standard Python virtual environment to avoid dependency conflicts.
The setup is just the standard git clone, pip install in a venv, and python server.py.
Oh wow, really impressive. How long did this take you to make?
It didn't take too long. I already have two similar projects for Dia and Chatterbox tts models so I just needed to convert a few files.
It mentions ONNX, so I imagine an ONNX model is or will be available.
ONNX runtime is a single library, with C#'s package being ~115MB compressed.
Not tiny, but usually only a few lines to actually run and only a single dependency.
The repository already runs an ONNX model. But the onnx model doesn't get English text as input, it gets tokenized phonemes. The prepocessing for that is where most of the dependencies come from.
Which is completely reasonable imho, but obviously comes with tradeoffs.
For space sensitive applications like embedded systems, could you shift the preprocessing to compile time?
You would need to constrain the vocabulary to see any benefits, but that could be reasonable. For example, you an enumeration of numbers, units and metric names could handle dynamic time, temperature and other dashboard items.
For something more complex like offline navigation, you already need to store a map. You could store street names as tokens instead of text. Add a few turn commands, and you have offline spoken directions without on device pre-processing.
We will try to get rid of dependencies.
Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.
This case might be different, but ... usually that "later" never happens.