I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you? It also requires tight coupling between the load balancer and the backends, which has problems of its own.
I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.
thanks for the YT paper!
my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.
and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.
simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward
(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))
The simplest and most effective strategy we know today to automatically recover that gives the impacted service the ability to avoid entering a metastable state is for clients to implement retries with exponential backoff. No circuit breaker-type functionality is required. Unfortunately it requires that clients be well behaved.
Also, circuit breakers have issues of their own:
“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.” https://aws.amazon.com/builders-library/timeouts-retries-and...
Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity. It’s a special case of bimodal behavior, which we try to avoid as a matter of sound operational practice.
Thundering herd is a known post-outage, service restore failure mode. You never let your load balancer and the API boxes behind it dip below some statically defined low water mark; the waste of money is better than going down when the herd shows up. When it does show up, as noted, token bucket rate limiter running 429s while the herd slowly gets let back onto the system. Even if the app server can take it, it's not a given that the eg queue or database systems can absorb the herd as well (especially if that's what went down).