My app uses this moonshine-voice python package, so you can experience it yourself here: https://rift-transcription.vercel.app/local-setup
I made moonshine the default because it has the best accuracy/latency (aside from Web Speech API, but that is not fully local)
I plan to add objective benchmarks in the future, so multiple models can be compared against the same audio data...
---
I made a custom WebSocket server for my project. It defines its own API (modeled on the Sherpa-onnx API), but you could adjust it to output the OpenAI Realtime API: https://github.com/Leftium/rift-local
(note rift-local is optimized for single connections, or rather not optimized to handle multiple WS connections)