You are looking for speaker diarization. No one is doing this well currently on device (in macOS land at least).
Or in the cloud tbh
Or in the cloud tbh