I find it odd that given the billions of dollars involved, no competitor has managed to replicate the functions of CUDA.

Is it that hard to do, or is the software lock-in so great?

The problem is that CUDA is tightly integrated with NVIDIA hardware. You don't just have to replicate CUDA (which is a lot of tedious work at best), but you also need the hardware to run your "I can't believe it's not CUDA"

I'm pretty sure it's a political limitation, not a technical one. Implementing it is definitely a pain - it's a mix of hardcore backwards compatibility (i.e. cruft) and a rapidly moving target - but it's also obviously just a lot of carefully chosen ascii written down in text files.

The non-nvidia hardware vendors really don't want cuda to win. AMD went for open source + collaborative in a big way, opencl then hsa. Both broadly ignored. I'm not sure what Intel are playing at with spirv - that stack doesn't make any sense to me whatsoever.

Cuda is alright though, in a kind of crufty obfuscation over SSA sense. Way less annoying than opencl certainly. You can run it on amdgpu hardware if you want to - https://docs.scale-lang.com/stable/ and https://github.com/vosen/ZLUDA already exist. I'm hacking on scale these days.

The thing that's also worth saying is that everyone speaks vaguely about CUDA's "institutional memory" and investment and so forth.

But the concrete qualities of CUDA and Nvidia's offerings generally is a move toward general purpose parallel computing. Parallel processing is "the future" and approach of just do loop and have each iteration be parallel is dead simple.

Which is to say Nvidia has invested a lot in making "easy things easy along with hard things no harder".

In contrast, other chip makers seem to be acculturated to the natural lock-in of having a dumb, convoluted interface to compensate for a given chip being high performance.

CUDA does involve a massive investment for Nvidia. It's not that it's impossible to replicate the functionality. But once a company has replicated that functionality, that company basically is going to be selling at competitive prices, which isn't a formula for high profits.

Notably, AMD funded a CUDA clone, ZLUDA, and then quashed it[1]. Comments at the time here involved a lot of "they would always be playing catch up".

I think the mentality of chip makers generally is that they'd rather control a small slice of a market than fight competitively for a large slice. It makes sense in that they invest years in advance and expect those investments to pay high profits.

[1] https://www.tomshardware.com/pc-components/gpus/amd-asks-dev...

Cuda isn't a massive investment, it's 20 years worth of institutional knowledge with a stable external api. There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.

> Cuda isn't a massive investmen

> it's 20 years worth of institutional knowledge with a stable external api

> There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.

To me that sounds like massive investment

ZLUDA was quashed due to concerns about infringement /violating terms of use.

That was the story but the legality of cloning an API/ABI/etc is well established by, for example Google vs Oracle (though with gotchas that might Nvidia to put a legal fight).

Because most fail to understand what makes CUDA great, and keep only trying to replicate C++ API.

They overlook CUDA is a polyglot ecosystem composed by C, C++ and Fortran as main languages, with Python JIT DSL since this year, compiler infrastructure for any compiler backend that wishes to target it of which there are a few including strange stuff like Haskell, IDE integration with Eclipse and Visual Studio, graphical debugging just like on the CPU.

It is like when Khronos puts out those spaghetti riddled standards, expecting each vendor/open source community, to create some kind of SDK, versus the vertical integration of console devkits and proprietary APIs, and then asking why professional studios have no qualms with proprietary tooling.

Slight correction: CUDA Python JIT has existed for a very long time. Warp is a late comer.

Kind of, none of those are at the integration level of CUTLASS 4, and the new cu tile architecture, introduced at GTC 2025.

But you're right there was already something in place.

I took a closer look at some of that and it’s pretty cool. Definitely neat to have some good higher level abstractions than the old C-style CUDA syntax that Numba was built on.

A better question is why there is no stronger push for a nicer GPU language that's not tied to any particular GPU and serves any purpose of GPU usage (whether it's compute or graphics).

I mean efforts like rust-gpu: https://github.com/Rust-GPU/rust-gpu/

Combine such language with Vulkan (using Rust as well) and why would you need CUDA?

Mojo might be what you are looking for: https://docs.modular.com/mojo/manual/gpu/intro-tutorial/

The language is general, but the current focus is really on programming GPUs.

I think Intel Fortran has some ability to offload to their GPUs now. And Nvidia has some stuff to run(?) CUDA from Fortran.

Probably just needs a couple short decades of refinement…

One of the reasons CUDA won over OpenCL, was that NVidia, contrary to Khronos, saw a value in helping those HPC researchers move their Fortran code into the GPU.

Hence they bought PGI, and improved their compiler.

Intel eventually did the same with Open API (which isn't plain OpenCL, rather an extension with Intel goodies).

I was on a Khronos webminar where the panel showed disbelief why anyone would care about Fortran, oh well.

It's insane how big the NVidia dev kit is. They've got a library for everything. It seems like they have as broad software support as possible.

That’s actually pretty surprising to me. Of course, there are always jokes about Fortran being some language that people don’t realize is still kicking. But I’d expect a standards group that is at least parallel computing adjacent to know that it is still around.

Yet not only they joked about Fortran, it took CUDA adoption success, for them to take C++ seriously and come up with SPIR as counterpoint to PTX.

Which in the end was worthless because both Intel and AMD botched all OpenCL 2.x efforts.

Hence why OpenCL 3.0 is basically OpenCL 1.0 rebranded, and SYSCL went its own way.

It took a commercial company, Codeaplay a former compiler vendor for games consoles, to actually come up with a good tooling for SYSCL.

Which Intel in the middle of extending SYSCL with their Data Paralell C++, eventually acquired.

Those products are in the foundation of One API, and naturally go beyond what barebones OpenCL happens to be.

The mismanagement Khronos has done with OpenCL is one of the reasons Apple lost ties with Khronos.

I like Julia for this. Pretty language, layered on LLVM like most things. Modular are doing interesting things with Mojo too. People seem to like cuda though.

CUDA is just DOA as a nice language being Nvidia only (not counting efforts like ZLUDA).

That's a compiler problem. Once could start from clang -xcuda and hack onwards. Or work in the intersection of CUDA and HIP which is relatively broad if a bit of a porting nuisance.

May be, but who is working on that compiler? And the whole ecosystem is controlled by a nasty company. You don't want to deal with that.

Besides, I'd say Rust is a nicer language than CUDA dialects.

Chris and Nick originally, a few more of us these days. Spectral compute. We might have a nicer world if people had backed opencl instead of cuda but whatever. Likewise rust has a serious edge over c++. But to the compiler hacker, this is all obfuscated SSA form anyway, it's hard to get too emotional about the variations.

Until Rust gets into any of industry compute standards, being a nicer language alone doesn't help.

Khronos standards, CUDA, ROCm, One API, Metal, none of them has Rust on their sights.

World did not back OpenCL, because it was stuck on a primitive C99 text based tooling, without an ecosystem.

Also Google decided to push their Renderscript C99 dialect instead, while Intel and AMD were busy delivering janky tools and broken drivers.

That's simply not true, because standard level should operate on the IR level, not on the language. You have to generate some IR from your language, at that level it makes sense to talk about standards. The only exception is probably WebGPU where Apple pushed using a fixed language instead of IR which is was a limiting idea.

None of those standards are about IR.

Also SPIR worked so great for OpenCL 2.x, that Khronos rebooted the whole mess back to OpenCL 1.x with OpenCL 3.0 rebranding.

They are pretty much about IR when it comes to language interchange. SPIR-V is explicitly an IR that can be targeted from a lot of different languages.

And so far not much has been happening, hence Shader Languages at Vulkanised 2026.

https://www.khronos.org/events/shading-languages-symposium-2...

These kind of projects is exactly it's happening.

Language would matter more for those who actually would want to write some programs in it. So I'd say rust-gpu is something that should get more backing.

Tooling and ecosystem, that is why.

Rust has great tooling and ecosystem. The point here is more of interest of those who want better alternatives to CUDA. AMD would be an obvious beneficiary to back the above, so I'm surprised about some lack of interest from their likes.

It has zero CUDA tooling, that is what is relevant when positioning itself as alternative to C, C++, Fortran, Python JIT, PTX based compilers, compute libraries, Visual Studio and Eclipse integration, graphical debugger.

Cross compiling Rust into PTX is not enough to make researchers leave CUDA.

And CUDA has zero non CUDA tooling. That's a pointless circular argument which doesn't mean anything. Rust has Rust tooling and it's very good.

Being language agnostic is also not the task of the language, but task of IR. There is already a bunch of languages, such as Slang. The point is to use Rust itself for it.

Where is the graphical debugging experience for Rust, given that it so great tooling?

Slang belongs to NVidia, and was nicely given to Khronos, because almost everyone started relying on HLSL, given that Khronos decided not to spend any additional resources on GLSL.

Just like with Mantle and Vulkan, Khronos seems that without external help they aren't able to produce anything meaningful since Long Peak days.

Cuda is so many things I'm not sure it is even possible to replicate it.

It is hard to do in the sense that it requires a very good taste about programming languages, which in turn requires really listening to the customers, and that requires huge number of people who are skilled. And no one has really invested that much money into their software ecosystem yet.

[deleted]

Vulkan is at 95% of CUDA performance already. The remaining 5% is CUDA's small dispatch logic.

The reason why people continue to use CUDA and Pytorch and so on is because they are literally too stupid and too lazy to do it any other way

With zero tooling, hence why no one cares about Vulkan, other than Valve and Google.

What tooling do you need? I'll make it for you for free

Great, lets start with a Fortran compiler like CUDA has.

When you're done, you can create IDE plugins, and a graphical debugger with feature parity to NInsights.

Ok, that's a good retort. How many months of work do those things save you, compared to actually solving the problem you want to solve without those tools?

The argument you are making sounds to me like, "well good luck making a Vulkan application without cmake, ninja, meson, git, visual studio, clion" etc, when in reality a 5 line bash script to gcc works just fine

Wrong analogy. You have no idea how wrong you are. Just look at the difference in performance analysis tools for AMD and Nvidia for GPUs. Nvidia makes it simple for people to write GPU programs.

I do have an idea of how wrong I am.

Nvidia's own people are the ones who have made Vulkan performance so close to CUDA's. AMD is behind, but the data shows that they're off in performance proportional to the cost of the device. If they implement coop mat 2, then they would bridge the gap.

99.9% of people who use Pytorch and so on could achieve good enough performance using a "simple vulkan backend" for whatever Python stuff they're used to writing. That would strip out millions of lines of code.

The reason nobody has done this outside of a few github projects that Nvidia themselves have contributed to, is because there isn't a whole lot of money in iterative performance gains, when in reality better algorithmic approaches are being invented quite near every month or so.

First step is to understand why proprietary technology gets adoption.

Lacking understanding is doomed to failure.