depends on the model!
If you run a smaller whisper-distil variant AND you optimize the decoder to run on Apple Neural Engine, you can get latency down to ~300ms without any backend infra.
The issue is that the smaller models tend to suck, which is why the fine-tuning is valuable.
My hypothesis is that you can distill a giant model like Gemini into a tiny distilled whisper model.
but it depends on the machina you are running, which is why local AI is a PITA.