Imagine if OpenAI weren't locked into Azure's stack.

A lot of people seem to think multi-cloud is an unrealistic dream. But using best in class primitives that are available in each cloud is not an unreasonable thing to do.

Yes, and OpenAI has enough financial resources to do a bespoke abstraction layer with multiple provider-specific performant implementations.

Regardless of whether they bring in the Kubernetes complexity.

(Internal codename: Goober Yetis.)

Why would they create a bespoke abstraction layer instead of just relying on k8s?

There is only pain on the path of recreating it, it will end up almost as complex as k8s and it will be hell to hire and train for. Best to just use something battle-tested that works with a large pool of people trained for it, even better: their own LLM has gobbled up all the content possible about k8s to help their engineers. K8s complexity came to be for reasons discovered during growing the stack which anyone doing a bespoke similar system might run into, and it's pretty modular since you can pick-and-choose the parts you actually need for your cluster.

Wasting manpower to recreate a bespoke Kubernetes doesn't sound great for a company burning billions per quarter, it's just more waste.

I tried to expressly say that Kubernetes could be used.

And, given their unusual needs and scale, there will probably be some kind of bespoke abstraction, whether it's an SDK, or a document, that says this is the subset of things you should be using, and how to use them, so that we can practically deploy our very unusual setup with different providers and customer facilities.

OpenAI has the resources to define that abstraction, and make it work well across multiple providers.

> And, given their unusual needs and scale, there will probably be some kind of bespoke abstraction, whether it's an SDK, or a document, that says this is the subset of things you should be using, and how to use them, so that we can practically deploy our very unusual setup with different providers and customer facilities.

Now I understand what you meant, I wouldn't have classified this as a bespoke solution since it's more of an operational guide and tooling, most places I worked at running k8s had something similar (a tech radar/best practices of sorts). When I read "bespoke" I thought of "custom built" which I don't think would be the right approach, running k8s and customising the control plane is quite common.

I trust that OpenAI engineers are smart enough to build a replacement Kubernetes. I also think they are smart enough to build better, more innovative desks, tables & chair. But both of those wont be a good use of their time.

There's a reason why company vision & mission exists. OpenAI's mission is not to build next k8s, but to build better AI models.

OpenAI is locked into _someone's_ compute resources... the one that is the cheapest. AFAIK OpenAI doesn't have much (any?) of their own hardware. With the mega-clouds buying up all the GPUs and building datacentres, you have to 'partner' with someone. Most likely the one that gives you the biggest discounts. The amount of compute that OpenAI needs dwarfs almost any other consideration.

OpenAI is absolutely not locked into the Azure stack. https://en.m.wikipedia.org/wiki/Stargate_LLC

All organizations that are reasonably large and for which cloud costs is a large portion of expenses have an abstraction layer to switch between providers. Otherwise it’s impossible to negotiate better deals, you can’t play multiple cloud providers against each other for a better rate.