Speaker "diarization" is what you're looking for, and currently the most popular solution is pyannote.audio.

Eventually I'm trying to get around to using it in conjunction with a fine-tuned whisper model to make transcriptions. Just haven't found the time yet.