I think the point is having it for real-time; this is for conversations rather than transcribing audio files.

That quote was for the non-realtime model.