I'm pretty sure it's a political limitation, not a technical one. Implementing it is definitely a pain - it's a mix of hardcore backwards compatibility (i.e. cruft) and a rapidly moving target - but it's also obviously just a lot of carefully chosen ascii written down in text files.
The non-nvidia hardware vendors really don't want cuda to win. AMD went for open source + collaborative in a big way, opencl then hsa. Both broadly ignored. I'm not sure what Intel are playing at with spirv - that stack doesn't make any sense to me whatsoever.
Cuda is alright though, in a kind of crufty obfuscation over SSA sense. Way less annoying than opencl certainly. You can run it on amdgpu hardware if you want to - https://docs.scale-lang.com/stable/ and https://github.com/vosen/ZLUDA already exist. I'm hacking on scale these days.
The thing that's also worth saying is that everyone speaks vaguely about CUDA's "institutional memory" and investment and so forth.
But the concrete qualities of CUDA and Nvidia's offerings generally is a move toward general purpose parallel computing. Parallel processing is "the future" and approach of just do loop and have each iteration be parallel is dead simple.
Which is to say Nvidia has invested a lot in making "easy things easy along with hard things no harder".
In contrast, other chip makers seem to be acculturated to the natural lock-in of having a dumb, convoluted interface to compensate for a given chip being high performance.