Native diarization, this looks exciting. edit: or not, no diarization in real-time.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
Native diarization, this looks exciting. edit: or not, no diarization in real-time.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.
Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..
You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see
Amazing. Thank you.
> Do you have experience with that model
No, I just heard about it this morning.
Ahh, yeah, and it's explicitly not working for realtime streams. Good catch!