What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service. Yes, having a queue becomes less important the more backend servers you have, but here even with 10 servers the plot shows your latency remains >25% worse than it would be with a queue. Also missing is discussion of how the variance in processing times affects you when you rely on load balancing alone.

> What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service.

As in between the service and the load balancer? There's already an infinite queue in the load balancer. You can try that out on https://stability-sim.systems/ to see the effect, but the short version is that (in this model) it makes things worse.

If you're saying that the queue in the load balancer should be limited in size to reduce tail latency, then I agree.

No, I mean when you have a queue broker that the backends can pull work from when they become idle, rather than relying on load balancing which will send work to backends while they're still busy.

This scenario already works that way. The very first sentence says "servers, each of which can only handle a single concurrent request, and has no internal queuing". This implies that the load balancer waits for a server to finish a request then immediately sends the next one.

I don't believe it does. As I understand it, the load balancer has a queue in which it can buffer infinite requests, but it drains that queue by pushing work to the backend servers in what's probably a round-robin fashion. So there is secondary queueing at each server. Even the "least connections" strategies available through some load balancers do not usually behave as you might expect (by always sending the next request to a server that's idle). Pull-based load balancing via a queue has its own downsides but the big upside is to make latency essentially a constant low overhead regardless of the number of servers in the typical case.

I think your imagination decided to rapidly overcomplicate what is literally (literally literally) Queuing Theory 101 example

If I were to guess there weren't any "backend servers" at all. It was just array of random increasing numbers (that stand for request arrival times) and arrays of numbers with minimum distance (that stand for time each consumer took a request)

there's no connections to "least-ify" the strategy about. There's no difference between consumers, no matter the amount of requests having been processed