I ran some quick benchmarks.

Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX

  Performance Results:

  Initial Latency: ~315ms for short text

  Audio Generation Speed (seconds of audio per second of processing):
  - Short text (12 chars): 3.35x realtime
  - Medium text (100 chars): 5.34x realtime
  - Long text (225 chars): 5.46x realtime
  - Very Long text (306 chars): 5.50x realtime

  Findings:
  - Model loads in ~710ms
  - Generates audio at ~5x realtime speed (excluding initial latency)
  - Performance is consistent across different voices (4.63x - 5.28x realtime)

Thanks for running the benchmarks. Currently the models are not optimized yet. We will optimize loading etc when we release an SDK meant for production :)

on my Intel(R) Celeron(R) N4020 CPU @ 1.10GHz it takes 6 seconds to import/load and text generation is roughly 1x realtime on various lengths of text.

thanks for testing on the same hardware as mine, before me.