Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?

I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.