Congrats on the results. The streaming aspect is what I find most exciting here.
I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.
Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?
My app uses this moonshine-voice python package, so you can experience it yourself here: https://rift-transcription.vercel.app/local-setup
I made moonshine the default because it has the best accuracy/latency (aside from Web Speech API, but that is not fully local)
I plan to add objective benchmarks in the future, so multiple models can be compared against the same audio data...
---
I made a custom WebSocket server for my project. It defines its own API (modeled on the Sherpa-onnx API), but you could adjust it to output the OpenAI Realtime API: https://github.com/Leftium/rift-local
(note rift-local is optimized for single connections, or rather not optimized to handle multiple WS connections)