Not to promote something, but Wispr Flow does that for me automatically if I trigger a setting for it..
While it's a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.
And I've used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven't touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.
Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.
The extra features smooth out the subtitle editing process very substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3" to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.
Surprisingly, it's the whisper model itself that does that. I find that it's also good with false starts, often correcting something like: "uhm, we could...we can go there" to just "we can go there", if spoken rapidly enough.
Is love to hear more about subtitle generation. Specifically, can you label different speakers? I'd be using this for meeting transcription. Thank you.