This is really impressive for running locally on an M3 Pro. The latency looks surprisingly good for real-time audio and video input.
Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.
Great work, bookmarking this.
During my limited testing, it works better than I expected at handling multiple languages in a single session. Perhaps I just had a low expectation since I've mostly worked with English-only STT models.