Hot take: I think all these dictation tools are solving the wrong problem: they're optimizing for accurate transcription (and latency) when users actually need intelligent interpretation. For example: People don't speak in perfect emails. They speak in scattered thoughts and intentions that require contextual understanding.
I totally agree with this hot take. Whispering is not there yet, but I eventually want it to store as many of the transcripts as plain text markdown, alongside your audio files, in a folder.
The idea is that as we add more local-first apps into the ecosystem (writing, etc.), they're share this context. Transcription would benefit immensely if you also had a writing app that you could trust to store your data. To execute that vision, we needed a transcription app where we have control over how data is stored, and the best solution was to build our own.
Doesn't an accurate transcription make it easier to reach understanding?