If your use-case is meeting, https://github.com/fastrepl/hyprnote is for you. OWhisper is more like a headless version of it.

Can you describe how it pick different voices? Does it need separate audio channels, or does it recognize different voices on the same audio input?

It separate mic/speaker as 2 channel. So you can reliably get "what you said" vs "what you heard".

For splitting speaker within channel, we need AI model to do that. It is not implemented yet, but I think we'll be in good shape somewhere in September.

Also we have transcript editor that you can easily split segment, assign speakers.