The only way to beat CUDA like in many other API cases, is by middleware.

Many keep forgeting that CUDA for years is a polyglot platform, C, C++, Fortran, plus anything PTX, some of which also target OpenCL, meaning Haskell, Java, C#, Julia, Futhark, or Python bindings.

Then there are the libraries, and GPGPU graphical debugging tools.

By the way, Modular just announced partnerships with AWS and NVidia for Mojo and related tooling.