> I'm thrilled to have people digging into this, because I think it's a super interesting problem
Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!
Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.
I work for AWS; opinions are my own and I’m not affiliated with the service team in question.
That a DNS record was deleted is tangential to the proximate cause of the incident. It was a latent bug in the control plane that updated the records, not the data plane. If the discovery protocol were DNS++ or /etc/hosts files, the same problem could have happened.
DNS has a lot of advantages: it’s a dirt cheap protocol to serve (both in terms of bytes over the wire and CPU utilization), is reasonably flexible (new RR types are added as needs warrant), isn’t filtered by middleboxes, has separate positive and negative caching, and server implementations are very robust. If you’re doing to replace DNS, you’re going to have a steep hill to climb.
> It was a latent bug in the control plane that updated the records, not the data plane
Yes, I know that. But part of the issue is that the control plane exists in the first place to smooth the impedance mismatch between DNS and how dynamic service discovery works in practice. If we had a protocol which better handled dynamic service discovery, the control plane would be much less complex and less prone to bugs.
As far as I have seen, most cloud providers internally use their own service discovery systems and then layer dns on top of that system for third party clients to access. For example, DynamoDB is registered inside of AWS internal service discovery systems, and then the control plane is responsible for reconciling the service discovery state into DNS (the part which had a bug). If instead we have a standard protocol for service discovery, you can drop that in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary.
I don’t know how AWS or DynamoDB works in practice, but I have worked at other hyperscalers where a similar setup exists (DNS is layered on top of some internal service discovery system).
> If you’re doing to replace DNS, you’re going to have a steep hill to climb.
Yes, no doubt. But as we have seen with wireguard, if there is a good idea that has merit it can be quickly adopted into a wide range of operating systems and libraries.
> If instead we have a standard protocol for service discovery, you can drop [reconciliation] in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary.
DNS is a service discovery protocol! And a rather robust one, too. Don’t forget that.
AWS doesn’t want to expose to the customer all the dirty details of how internal routing is done. They want to publish a single regional service endpoint, put a SLO on it, and handle all the complexity themselves. Saving unnecessary complexity from customers is, after all, one of the key value propositions of a managed service. It also allows the service provider the flexibility to change the underlying implementation without impacting customer clients.
I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol. A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.” Fail open, as it were.
Also, anyone proposing a new protocol in response to a problem—especially one that had nothing to do with the protocol itself—should probably be burdened with defining and implementing its replacement. ;)
> I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol
That is not what I am proposing. The current state is that there are two reconcilers (DNS and internal service discovery) and collapsing those into one reconciler protocol will simplify the system.
> especially one that had nothing to do with the protocol itself
Part of the problem is the increased system complexity by layering multiple service discovery systems on top of each other.
> A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.”
You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of? A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system. Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.
I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.
> You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of?
You can’t. That’s just life. The electrical and building codes didn’t start as thousand-page tomes, but as we gained experience over the course of countless incidents, the industry recorded those lessons as prescriptions. Every rule was written in blood, as they say, and now practitioners are bound to follow them. We don’t have the same regulatory framework to ensure we build resilient services, but on the other hand, nobody has died or been seriously injured as a consequence of an internet service failure.
> A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system.
Why not do both? However, some systems have irreducible complexity for good reason, and it is better to see whether that is in fact the case before proposing armchair prescriptions.
> Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.
I was formerly the lead of the AWS Well-Architected reliability pillar. You’re describing an AWS design and operating principle, and many services do just that (I’m not sure about DynamoDB but it would surprise me if they didn’t). However, at the end of the day, there is a single regional service endpoint customers use.
> I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.
Nobody is accusing you of such behavior, but you also haven’t proposed a concretely better solution, and the one you have mentioned in other replies (Envoy xDS) isn’t built for purpose. It might work fine in the context of a Kubernetes cluster, but it’s certainly not appropriate for Internet-scale service discovery or the planetary scale edge service fabric that fly.io is building.
I'm nodding my head to this but have to call out that DNS with "interesting" RRs is extensively filtered by middleboxes --- just none of the middleboxes AWS would deploy or allow to be deployed anywhere it peers.