If anyone's interested in a janky-but-works-great dictation setup on Linux, here's mine:
On key press, start recording microphone to /tmp/dictate.mp3:
# Save up to 10 mins. Minimize buffering. Save pid
ffmpeg -f pulse -i default -ar 16000 -ac 1 -t 600 -y -c:a libmp3lame -q:a 2 -flush_packets 1 -avioflags direct -loglevel quiet /tmp/dictate.mp3 &
echo $! > /tmp/dictate.pid
On key release, stop recording, transcribe with whisper.cpp, trim whitespace and print to stdout: # Stop recording
kill $(cat /tmp/dictate.pid)
# Transcribe
whisper-cli --language en --model $HOME/.local/share/whisper/ggml-large-v3-turbo-q8_0.bin --no-prints --no-timestamps /tmp/dictate.mp3 | tr -d '\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes. bindsym XF86Launch5 exec dictate.sh start
bindsym --release XF86Launch5 exec echo "type $(dictate.sh stop)" | dotoolc
This gives a very functional push-to-talk setup.I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.
I'm keeping an eye on the NVidia models, hopefully they work on ggml soon too. E.g. https://github.com/ggml-org/whisper.cpp/issues/3118.
before i even bother opening that github: does it work on windows? so far, all of the whisper "clones" run poorly, if at all, on windows. I do have a 3060 and a 1070ti i could use just for whisper on linux, but i have this 3090 on my windows desktop that works "fine" for whisper, TTS, SD, LLM.
Whisper on windows, the openai-whisper, doesn't have these q8_0 models, it has like 8 models, and i always get an error about triton cores (something about timeptamping i guess), which windows doesn't have. I've transcribed >1000 hours of audio with this setup, so i'm used to the workflow.
If you want to stay near the bleeding edge with this stuff, you probably want to be on some kind of linux (or lacking that, Mac). Windows is where stuff just trickles down to eventually.
you're not wrong; i really should try just running linux and seeing how good the steam gaming layer is these days. And if SDR# runs on linux under wine or whatever.
This is my favorite kind of software, to write and to use. I am reminded of Upton Sinclair's evergreen quote:
> It is difficult to get a man to understand something, when his salary depends upon his not understanding it!
in the special case where the thing to be understood is "your app doesn't need to be a Big Fucking Deal". Maybe it pleases some users to wrap this in layers of additional abstraction and chrome and clicky buttons and storefronts, but in the end the functionality is already there with a couple of FOSS projects glued together in a bash script.
I used to think the likes of Suckless were brutalist zealots, but more and more I think they (and the Unix patriarchs) were right and the path to enlightenment is expressed in plain text.
Thanks for sharing a great alternative! It seems that that setup can go a long way for Linux users.