I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.
Aside: Are there any models for understanding voice to text, fully offline, without training?
I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"
Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.
My mid-range AMD CPU is multiple times faster than realtime with parakeet.
>Aside: Are there any models for understanding voice to text, fully offline, without training?
OpenAI's whisper is a few years old and pretty solid.
https://github.com/openai/whisper
Whisper tends to fill silence with random garbage from its training set. [0] [1] [2]
[0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608
Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.
"The brown fox jumps over the lazy dog.."
Average duration per generation: 1.28 seconds
Characters processed per second: 30.35
--
"Um"
Average duration per generation: 0.22 seconds
Characters processed per second: 9.23
--
"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."
Average duration per generation: 2.25 seconds
Characters processed per second: 35.04
--
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 80
model name : AMD Ryzen 7 5800H with Radeon Graphics
stepping : 0
microcode : 0xa50000c
cpu MHz : 1397.397
cache size : 512 KB
Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.
I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.
assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.
Any idea what factors play into latency in TTS models?
Mostly model size, and input size. Some models which use attention are O(N^2)