I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).
I think it's actually hard to verify how correct a transcription is, at scale. Curious where those error rate numbers come from, because they should test it on people actually doing their job.
You missed a giant factor: domain knowledge. Transcribing something outside of your knowledge realm is very hard. I posted above about transcribing the commentary of a motorbike race where the commentators only used the slang names of the riders.
I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).
[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...
I think it's actually hard to verify how correct a transcription is, at scale. Curious where those error rate numbers come from, because they should test it on people actually doing their job.
It can depend a lot on different factors like:
- familiarity with the accent and/or speaker;
- speed and style/cadence of the speech;
- any other audio that is happening that can muffle or distort the audio;
- etc.
It can also take multiple passes to get a decent transcription.
You missed a giant factor: domain knowledge. Transcribing something outside of your knowledge realm is very hard. I posted above about transcribing the commentary of a motorbike race where the commentators only used the slang names of the riders.
Most of these errors will not be meaningful. Real speech is full of ambiguities. 3% is low