> It's also not very great at meeting summaries especially those where many speakers are in a room on the same microphone.
It is astonishingly poor at this. My intuition was that it should be good at this (it is basically a translation problem right? And LLMs are fundamentally translation systems) but the practical results are so poor. Not just mis-identifying speakers (frequently saying PersonX responded to PersonX) but managing complete opposite conclusions from what was actually said.
I'm genuinely intrigued as to what approaches have been taken in this space and what the "hard problem" is that is stopping it being good.
I mean it is a tough problem, you'd really have to voiceprint each speaker. But I'm sure this is technically possible considering voice cloning is pretty commonplace now.
And yeah the transcription quality also drops a lot. Where humans are still quite capable at reading it. Sometimes when I read the transcript I'm quite surprised it manages to make any intelligble minutes out of it at all.
I just don't understand how Microsoft place this feature as a minute-taking replacement when it's not ready for really super common usecases.