Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.

On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.

Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.

Incidentally, waiting for Apple to blow this all up with native STT any day now. :)

How does it compare to the more well established https://github.com/cjpais/handy? Are there any stand out features (for either option)? What was the reason for writing your own rather than using or improving existing software?

Not sure I know what you mean by IR...

But in this case I built hyprwhspr for Linux (Arch at first).

The goal was (is) the absolute best performance, in both accuracy & speed.

Python, via CUDA, on a NVIDIA GPU, is where that exists.

For example:

The #1 model on the ASR (automatic speech recognition) hugging face board is Cohere Transcribe and it is not yet 2 weeks old.

The ecosystem choices allowed me to hook it up in a night.

Other hardware types also work great on Linux due to its adaptability.

In short, the local stt peak is Linux/Wayland.

IR was a typo, meant "it" (fixed it). I blame the phone keyboard plus insufficient proof reading on my part.

If this needs nvidia CPU acceleration for good performance it is not useful to me, I have Intel graphics and handy works fine.

It works well with anything. :)

That said: If handy works, no need whatsoever to change.

I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.

Ah yeah, longform is interesting.

Not sure how you're running it, via whichever "app thing", but...

On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.

This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.

Maybe you can try hackin' that up?

Yeah that makes sense, chunking on silence would sidestep the latency issue pretty cleanly. I've been running it through a basic fastapi wrapper so it just takes whatever audio blob gets thrown at it, no chunking logic on the server side. Might be worth adding a vad pass before sending to whisper though, would cut down on processing dead air too.

Nice, I've been using Hyprwhspr on Omarchy daily for a while now, it's been awesome, thanks very much.

Thanks for sharing! I was literally getting ready to build, essentially, this. Now it looks like I don't have to!

Have you ever considered using a foot-pedal for PTT?

Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.

They do, and they even have that nice microphone F5 key for it, and an ideal OS level API making the input experience >perfect<.

Apparently they do have a better model, they just haven't exposed it in their own OS yet!

https://developer.apple.com/documentation/speech/bringing-ad...

Wonder what's the hold up...

For footpedal:

Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.

Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)

The only issue with Apple models is that they do not detect languages automatically, nor switch if you do between sentences.

Parakeet does both just fine.

sorry, PTT?

push-to-talk.

ty

[deleted]

looks like there's a nearly identically named one for Hyprland

Also, wish it was on nixpkgs, where at least it will be almost guaranteed to build forever =)