I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.

My mid-range AMD CPU is multiple times faster than realtime with parakeet.

>Aside: Are there any models for understanding voice to text, fully offline, without training?

OpenAI's whisper is a few years old and pretty solid.

https://github.com/openai/whisper

Whisper tends to fill silence with random garbage from its training set. [0] [1] [2]

[0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608

Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.

"The brown fox jumps over the lazy dog.."

Average duration per generation: 1.28 seconds

Characters processed per second: 30.35

--

"Um"

Average duration per generation: 0.22 seconds

Characters processed per second: 9.23

--

"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."

Average duration per generation: 2.25 seconds

Characters processed per second: 35.04

--

processor : 0

vendor_id : AuthenticAMD

cpu family : 25

model : 80

model name : AMD Ryzen 7 5800H with Radeon Graphics

stepping : 0

microcode : 0xa50000c

cpu MHz : 1397.397

cache size : 512 KB

Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.

I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.

assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.

Any idea what factors play into latency in TTS models?

Mostly model size, and input size. Some models which use attention are O(N^2)