Slight correction: CUDA Python JIT has existed for a very long time. Warp is a late comer.

Kind of, none of those are at the integration level of CUTLASS 4, and the new cu tile architecture, introduced at GTC 2025.

But you're right there was already something in place.

I took a closer look at some of that and it’s pretty cool. Definitely neat to have some good higher level abstractions than the old C-style CUDA syntax that Numba was built on.