Hi HN! Voicelab is an API for optimized inference of top open-source TTS models. CSM-1B and Orpheus are the currently supported models, but we’re adding Dia, Chatterbox, Kokoro, and more in the next couple of weeks.
While new ultra-realistic open-source voice models come out every month, most people still use one of a couple of closed-source providers. The reason is that these research previews can lack production-readiness; their inference stacks are usually not amenable to running things scalably (i.e. only one concurrent stream per GPU), and the public weights can generate speech of inconsistent quality.
We solved this by building out serving infrastructure that’s optimized for audio transformers (to make scalable inference quicker and more cost-effective) and by post-training public weights using voice actors, phone calls, and other privately collected audio data to make generation quality more consistent.
Open-source voice is becoming exciting, and the hope is that we can provide a high-quality, scalable inference layer to use all of the rich research that these teams are putting out. Feedback is much appreciated :)
Docs: docs.vogent.ai Playground: app.vogent.ai