I believe it is possible within envoy to detect a bad backend and automatically remove it from the load balancing pool, so why can the proxy not determine that certain backend instances are unavailable and remove them from the pool? No coordination needed and it also handles other cases where the backend is bad such as overload or deadlock?
It also seems like part of your pain point is that there is an any-to-any relationship between proxy and backend, but that doesn’t need to be the case necessarily, cell based architecture with shuffle sharding of backends between cells can help alleviate that fundamental pain. Part of the advantage of this is that config and code changes can then be rolled out cell by cell which is much safer as if your code/configs cause a fault in a cell it will only affect a subset of infrastructure. And if you did shuffle sharding correctly, it should have a negligible affect when a single cell goes down.
Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.
The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.
I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.
> The interesting problem is noticing than and then informing every proxy in the world.
Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.
Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.
Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.
We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.
I think the impedance mismatch here is I am assuming we are talking about a hyperscaler cloud where it would be reasonable to have say 1024 proxies per region. Each app would be assigned to 4/1024 proxies in each region.
I have no idea how big of a compute footprint fly.io is, and maybe due to that the design I am suggesting makes no sense for you.
The design you are suggesting makes no sense for us. That's OK! It's an interesting conversation. But no, you can't fix the problem we're trying to solve with shuffle shard.
Out of curiosity, what’s your upper bound latency SLO for propagating this state? (I assume this actually conforms to a percentile histogram and isn’t a single value.)