AFAIK everything I use has timeouts, retries, and some way of throwing up its hands and turning things back to me.

I use several providers interchangeably.

I stay away from overly complex distributed systems and use the simplest thing possible.

I plan to wait for some guys in China to train a model on traces that I can run locally, benefitting from their national “diffusion” strategy and lack of access to bleeding-edge chips.

I’m not worried.