I've been interested in dictation for a while, but I don't want to be sending any audio to a remote API, it all has to be local. Having tried just a couple of models (namely the one used by the FUTO Keyboard), I'm kinda feeling like we're not quite there yet.

My biggest gripe perhaps is not being able to get decent content out of a thought stream; the models can't properly filter out the pauses, "uuuuhmms", and much less so handle on the fly corrections to what I've been saying, like going back and repeating something with a slight variation and whatnot.

This is a challenging problem I'd love to see being tackled well by open models I can run on my computer or phone. Are there new models more capable of this? Is it not just a model thing, and I missing a good app too?

In the meanwhile, I'll keep typing, even though it can be quite a bit less convenient to do; especially true for note taking on the go.

Have you tried Whisper itself? It's open-weights.

One of the features of the project posted above is "transformations" that you can run on transcripts. They feed the text into an LLM to clean it up. If you're willing to pay for the tokens, I think you could not only remove filler-words, but could probably even get the semantically-aware editing (corrections) you're talking about.

^Yep, unfortunately, the best option right now seems to pipe the output into another LLM to do some cleanup, which we try to help you do in Whispering. Recent transcription models don't have very good built-in inference/cleanup, with Whisper having the very weak "prompt" parameter. It seems like this is probably by design to keep these models lean/specialized/performant in their task.

By try to help, do you mean that it currently does so or that functionality is otw

[deleted]