I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.
Two other details that are super important here:
This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.
The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.
The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.
> I'm thrilled to have people digging into this, because I think it's a super interesting problem
Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!
Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.
I work for AWS; opinions are my own and I’m not affiliated with the service team in question.
That a DNS record was deleted is tangential to the proximate cause of the incident. It was a latent bug in the control plane that updated the records, not the data plane. If the discovery protocol were DNS++ or /etc/hosts files, the same problem could have happened.
DNS has a lot of advantages: it’s a dirt cheap protocol to serve (both in terms of bytes over the wire and CPU utilization), is reasonably flexible (new RR types are added as needs warrant), isn’t filtered by middleboxes, has separate positive and negative caching, and server implementations are very robust. If you’re doing to replace DNS, you’re going to have a steep hill to climb.
> It was a latent bug in the control plane that updated the records, not the data plane
Yes, I know that. But part of the issue is that the control plane exists in the first place to smooth the impedance mismatch between DNS and how dynamic service discovery works in practice. If we had a protocol which better handled dynamic service discovery, the control plane would be much less complex and less prone to bugs.
As far as I have seen, most cloud providers internally use their own service discovery systems and then layer dns on top of that system for third party clients to access. For example, DynamoDB is registered inside of AWS internal service discovery systems, and then the control plane is responsible for reconciling the service discovery state into DNS (the part which had a bug). If instead we have a standard protocol for service discovery, you can drop that in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary.
I don’t know how AWS or DynamoDB works in practice, but I have worked at other hyperscalers where a similar setup exists (DNS is layered on top of some internal service discovery system).
> If you’re doing to replace DNS, you’re going to have a steep hill to climb.
Yes, no doubt. But as we have seen with wireguard, if there is a good idea that has merit it can be quickly adopted into a wide range of operating systems and libraries.
> If instead we have a standard protocol for service discovery, you can drop [reconciliation] in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary.
DNS is a service discovery protocol! And a rather robust one, too. Don’t forget that.
AWS doesn’t want to expose to the customer all the dirty details of how internal routing is done. They want to publish a single regional service endpoint, put a SLO on it, and handle all the complexity themselves. Saving unnecessary complexity from customers is, after all, one of the key value propositions of a managed service. It also allows the service provider the flexibility to change the underlying implementation without impacting customer clients.
I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol. A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.” Fail open, as it were.
Also, anyone proposing a new protocol in response to a problem—especially one that had nothing to do with the protocol itself—should probably be burdened with defining and implementing its replacement. ;)
> I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol
That is not what I am proposing. The current state is that there are two reconcilers (DNS and internal service discovery) and collapsing those into one reconciler protocol will simplify the system.
> especially one that had nothing to do with the protocol itself
Part of the problem is the increased system complexity by layering multiple service discovery systems on top of each other.
> A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.”
You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of? A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system. Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.
I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.
> You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of?
You can’t. That’s just life. The electrical and building codes didn’t start as thousand-page tomes, but as we gained experience over the course of countless incidents, the industry recorded those lessons as prescriptions. Every rule was written in blood, as they say, and now practitioners are bound to follow them. We don’t have the same regulatory framework to ensure we build resilient services, but on the other hand, nobody has died or been seriously injured as a consequence of an internet service failure.
> A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system.
Why not do both? However, some systems have irreducible complexity for good reason, and it is better to see whether that is in fact the case before proposing armchair prescriptions.
> Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.
I was formerly the lead of the AWS Well-Architected reliability pillar. You’re describing an AWS design and operating principle, and many services do just that (I’m not sure about DynamoDB but it would surprise me if they didn’t). However, at the end of the day, there is a single regional service endpoint customers use.
> I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.
Nobody is accusing you of such behavior, but you also haven’t proposed a concretely better solution, and the one you have mentioned in other replies (Envoy xDS) isn’t built for purpose. It might work fine in the context of a Kubernetes cluster, but it’s certainly not appropriate for Internet-scale service discovery or the planetary scale edge service fabric that fly.io is building.
I'm nodding my head to this but have to call out that DNS with "interesting" RRs is extensively filtered by middleboxes --- just none of the middleboxes AWS would deploy or allow to be deployed anywhere it peers.
> you still need reliable realtime update of instances going down
The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.
Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…
Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.
I believe it is possible within envoy to detect a bad backend and automatically remove it from the load balancing pool, so why can the proxy not determine that certain backend instances are unavailable and remove them from the pool? No coordination needed and it also handles other cases where the backend is bad such as overload or deadlock?
It also seems like part of your pain point is that there is an any-to-any relationship between proxy and backend, but that doesn’t need to be the case necessarily, cell based architecture with shuffle sharding of backends between cells can help alleviate that fundamental pain. Part of the advantage of this is that config and code changes can then be rolled out cell by cell which is much safer as if your code/configs cause a fault in a cell it will only affect a subset of infrastructure. And if you did shuffle sharding correctly, it should have a negligible affect when a single cell goes down.
Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.
The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.
I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.
> The interesting problem is noticing than and then informing every proxy in the world.
Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.
Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.
Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.
We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.
I think the impedance mismatch here is I am assuming we are talking about a hyperscaler cloud where it would be reasonable to have say 1024 proxies per region. Each app would be assigned to 4/1024 proxies in each region.
I have no idea how big of a compute footprint fly.io is, and maybe due to that the design I am suggesting makes no sense for you.
The design you are suggesting makes no sense for us. That's OK! It's an interesting conversation. But no, you can't fix the problem we're trying to solve with shuffle shard.
Out of curiosity, what’s your upper bound latency SLO for propagating this state? (I assume this actually conforms to a percentile histogram and isn’t a single value.)
(Hopping in here because the discussion is interesting... feel very free to ignore.)
Thanks for writing this up! It was a very interesting read about a part of networking that I don't get to seriously touch.
That said: I'm sure you guys have thought about this a lot and that I'm just missing something, but "why can't every proxy probe every [worker, not application]?" was exactly one of the questions I had while reading.
Having the workers being the source-of-truth about applications is a nicely resilient design, and bruteforcing the problem by having, say 10k proxies each retrieve the state of 10k workers every second... may not be obviously impossible? Somewhat similar to sending/serving 10k DNS requests/s/worker? That's not trivial, but maybe not _that_ hard? (You've been working on modern Linux servers a lot more than I, but I'm thinking of e.g. https://blog.cloudflare.com/how-to-receive-a-million-packets...)
I did notice the sentence about "saturating our uplinks", but... assuming 1KB=8Kb of compressed critical state per worker, you'd end up with a peak bandwidth demand of about 80 Mbps of data per worker / per proxy; that may not be obviously impossible? (One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.)
(Obviously, bruteforcing the routing table does not get you out of doing _something_ more clever than that to tell the proxies about new workers joining/leaving the pool, and probably a hundred other tasks that I'm missing; but, as you imply, not all tasks are equally timing-critical.)
The other question I had while reading was why you need one failure/replication domain (originally, one global; soon, one per-region); if you shard worker state over 100 gossip (SWIM Corrosion) instances, obviously your proxies do need to join every sharded instance to build the global routing table - but bugs in replication per se should only take down 1/100th of your fleet, which would hit fewer customers (and, depending on the exact bug, may mean that customers with some redundancy and/or autoscaling stay up.) This wouldn't have helped in your exact case - perfectly replicating something that takes down your proxies - but might make a crash-stop of your consensus-ish protocol more tolerable?
Both of the questions above might lead to a less convenient programming model, which be enough reason on its own to scupper it; an article isn't necessarily improved by discussing every possible alternative; and again, I'm sure you guys have thought about this a lot more than I did (and/or that I got a couple of things embarassingly wrong). But, well, if you happen to be willing to entertain my questions I would appreciate it!
(I used to work at Fly, specifically on the proxy so my info may be slightly out of date, but I've spent a lot of time thinking about this stuff.)
> why can't every proxy probe every [worker, not application]?
There are several divergent issues with this approach (though it can have it's place). First, you still need _some_ service discovery to tell you where the nodes are, though it's easy to assume this can be solved via some consul-esque system. Secondly, there is a lot more data than you might be thinking at play here. A single proxy/host might have many thousands of VMs under its purview. That works out to a lot of data. As you point out there are ways to solve this:
> One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.
This is definitely an improvement. But we have a new issue. Lets say I have proxies A, B, and C. A and C lose connectivity. Optimally (and in fact fly has several mechanisms for this) A could send it's traffic to C via B. But in this case it might not even know that there is a VM candidate on C at all! It wasn't able to sync data for a while.
There are ways to solve this! We could make it possible for proxies to relay each others state. To recap: - We have workers that poll each other - They exchange diffs rather than the full state - The state diffs can be relayed by other proxies
We have in practice invented something quite close to a gossip protocol! If we continued drawing the rest of the owl you might end up with something like SWIM.
As far as your second question I think you kinda got it exactly. A crash of a single corrosion does not generally affect anything else. But if something bad is replicated, or there is a gossip storm, isolating that failure is important.
Thanks a lot for your response!
Hold up, I sniped Dov into answering this instead of me. :)