So I use a VAD onnx (Silero [1]) to automatically detect when someone is talking, and then it sends the audio into one of the voice recognition libraries.
I originally tried to get away with just Whisper Tiny in the chess game [2], but it performs worse on the kinds of short phrases (knight E4, c takes d5, etc) used to dictate chess notation. Even with hotword-based phrasing and corrections, I found its accuracy on brief inputs noticeably poorer. So I switched over to Sherpa [3] trained on gigaspeech. It’s significantly more accurate, but it also comes with a correspondingly larger memory footprint.
Ideally, I would have used just one engine, but I needed a fallback for iOS devices (especially older ones) which can easily OOM.
[1] - https://github.com/snakers4/silero-vad