I ran some quick benchmarks.
Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX
Performance Results:
Initial Latency: ~315ms for short text
Audio Generation Speed (seconds of audio per second of processing):
- Short text (12 chars): 3.35x realtime
- Medium text (100 chars): 5.34x realtime
- Long text (225 chars): 5.46x realtime
- Very Long text (306 chars): 5.50x realtime
Findings:
- Model loads in ~710ms
- Generates audio at ~5x realtime speed (excluding initial latency)
- Performance is consistent across different voices (4.63x - 5.28x realtime)
Thanks for running the benchmarks. Currently the models are not optimized yet. We will optimize loading etc when we release an SDK meant for production :)
on my Intel(R) Celeron(R) N4020 CPU @ 1.10GHz it takes 6 seconds to import/load and text generation is roughly 1x realtime on various lengths of text.
thanks for testing on the same hardware as mine, before me.