Hacker News

(Tailscalar here) We're taking this kind of outage very seriously. In particular this outage meant newly connected devices couldn't reliably reach our control plane and couldn't get the latest network state. IMO that's not okay.

One of Tailscale's fundamental promises is that we want to try as much as possible to get our control plane and infrastructure as out of the way of your connectivity paths, while still using our infra to "assist" when there's connectivity issues (like difficult to traverse NAT), and maintain trust across the network, and keep everything up to date.

It's a tough balance and this year we're dedicating resources to making sure even small blips in our control plane don't mean temporary losses of connectivity across even your newly woken up devices. In particular we're taking a multi-pronged approach, right now. We're working in parallel to increase client tolerance of control outages (in response to cracks shown in this incident) and have an ongoing effort to make the control plane more resilient and available.