Have you ever wondered how much work would have been saved by the Pytorch team if they could have used just Cuda for all the platforms they support? If they didn't have to write compatibility abstractions or layers, and instead just focused on the problem of training neural networks? What if all the primitives they used from Cuda and cuDNN worked just as well on AMD GPUs, Apple GPUs, and probably Google's TPUs as they did on Nvidia GPUs?

Mojo and Modular's Max platform would do to heterogeneous compute what LLVM did to programming language development. People who dismiss the real value offering here know nothing. Modular have already raised $350m+ from industry giants (including Nvidia and Google) to solve this, and I believe they will.