I wonder how they fix things when Claude is down.

I would bet that they have inference setup for internal use on a separate system from the customer-facing production environment. The same way telemetry infrastructure needs to be run separate from normal production systems, so you aren't "blind" when you need it most.

Based on this outage: not very well.

This is ( or will be in the future ) a surprisingly relevant issue

maybe they ask a secondary agentic system to fix it. will that be the future of “redundancy”?

"Gemini, fix my Claude infra"

lol probably use their dev or qa Claude environment to fix prod