I'd love to find a tool which could recognise a few different speakers so that I could automatically dictate 1:1 sessions. In addition, I definitively would want to feed that to an LLM to cleanup the notes (to remove all "umm" and similar nonsense) and to do context aware spell checking.

The LLM part should be very much doable, but I'm not sure if speaker recognition exists in a sufficiently working state?

Speaker "diarization" is what you're looking for, and currently the most popular solution is pyannote.audio.

Eventually I'm trying to get around to using it in conjunction with a fine-tuned whisper model to make transcriptions. Just haven't found the time yet.

Shameless plug -- check out speechischeap.com

I spent three months perfecting the speaker diarization pipeline and I think you'll be quite pleased with the results.

How well does it work with multiple languages?