Because most fail to understand what makes CUDA great, and keep only trying to replicate C++ API.
They overlook CUDA is a polyglot ecosystem composed by C, C++ and Fortran as main languages, with Python JIT DSL since this year, compiler infrastructure for any compiler backend that wishes to target it of which there are a few including strange stuff like Haskell, IDE integration with Eclipse and Visual Studio, graphical debugging just like on the CPU.
It is like when Khronos puts out those spaghetti riddled standards, expecting each vendor/open source community, to create some kind of SDK, versus the vertical integration of console devkits and proprietary APIs, and then asking why professional studios have no qualms with proprietary tooling.
Slight correction: CUDA Python JIT has existed for a very long time. Warp is a late comer.
Kind of, none of those are at the integration level of CUTLASS 4, and the new cu tile architecture, introduced at GTC 2025.
But you're right there was already something in place.
I took a closer look at some of that and it’s pretty cool. Definitely neat to have some good higher level abstractions than the old C-style CUDA syntax that Numba was built on.