Does it do separate speaker identification (diarization)?

What's the stack, if I may ask? (I believe Whisper-X does the diarization thing)