If you don't want/need to program at lowest level possible, then Pytorch seems the obvious option for AMD support, or maybe Mojo. The Triton compiler would be another option for kernel writing.
If you don't want/need to program at lowest level possible, then Pytorch seems the obvious option for AMD support, or maybe Mojo. The Triton compiler would be another option for kernel writing.
I don't think that's something that can be pitched as a CUDA alternative. Just different level.
Triton, while a compiler, generates code at a lower level than CUDA or ROCm.
The machine code that actually runs on NVidia and AMD GPUs respectively are SASS and AMDGCN, and in each case there is also an intermediate level of representation:
CUDA -> PTX -> SASS
ROCm -> LLVM-IR -> AMDGCN
The Triton compiler isn't generating CUDA or ROCm - it generates it's own generic MLIR intermediate representation, which then gets converted into PTX or LLVM-IR, with vendor-specific tools then doing the final step.
If you are interested in efficiency and wanted to write high level code, then you might be using Pytorch's torch.compile, which then generates Triton kernels, etc.
If you really want to squeeze the highest performance out of an NVIDA GPU then you would write in PTX assembler, not CUDA, and for AMD in GCN assembler.