Hacker News

The queuing discipline used by default (pfifo_fast) is barely more than 3 FIFO queues bundled together. The 3 queues allow for a barest minimum semblance of prioritisation of traffic, where Queue 0 > 1 > 2, and you can tweak some tcp parameters to have your traffic land in certain queues. If there's something in queue 0 it must be processed first before anything in queue 1 gets touched etc.

Those queues operate purely head-of-queue basis. If what is at the top of the queue 0 is blocked in any way, the whole queue behind it gets stuck, regardless of if it is talking to the same destination, or a completely different one.

I've seen situations where a glitching network card caused some serious knock on impacts across a whole cluster, because the card would hang or packets would drop, and that would end up blocking the qdisc on a completely healthy host that was in the middle of talking to it, which would have impacts on any other host that happened to be talking to that healthy host. A tiny glitch caused much wider impacts than you'd expect.

The same kind of effect would happen from a VM that went through live migration. The tiny, brief pause would cause a spike of latency all over the place.

There are classful alternatives like fq_codel that can be used, that can mitigate some fo this, but you do have to pay a small amount of processing overhead on every packet, because now you have a queuing discipline that actually needs to track some semblance of state.