I mean it is a tough problem, you'd really have to voiceprint each speaker. But I'm sure this is technically possible considering voice cloning is pretty commonplace now.
And yeah the transcription quality also drops a lot. Where humans are still quite capable at reading it. Sometimes when I read the transcript I'm quite surprised it manages to make any intelligble minutes out of it at all.
I just don't understand how Microsoft place this feature as a minute-taking replacement when it's not ready for really super common usecases.