CUDA isn't really used for new code. Its used for legacy codebases.

In the LLM world, you really only see CUDA being used with Triton and/or PyTorch consumers that haven't moved onto better pastures (mainly because they only know Python and aren't actually programmers).

That said, AMD can run most CUDA code through ROCm, and AMD officially supports Triton and PyTorch, so even the academics have a way out of Nvidia hell.

> CUDA isn't really used for new code.

I don't think this is particularly correct, or at least worded a bit too strongly.

For Nvidia hardware, CUDA just gives the best performance, and there are many optimized libraries that you'd have to replace as well.

Granted, new ML frameworks tend to be more backend agnostic, but saying that CUDA is no longer being used, seems a bit odd.

If you're not doing machine code by hand, you're not a programmer

If you are not winding copper around magnets by hand, you are not a real programmer

I get the joke you two are making, but I've seen what academics have written in Python. Somehow, its worse than what academics used to write when Java was taught has the only language for CompSci degrees.

At least Java has types and can be performant. The world was ever so slightly better back then.

There is some truly execrable Python code out there, but it’s there because the barrier to entry is so low. Especially back in the day, Java had so many guardrails that the really bad Java code came from intermediate programmers pushing up against the limitations of the language rather than from novices pasting garbage into a notebook. As a result there was less of it, but I’m not convinced that’s a good thing.

Edit: my point being that out of a large pool of novices, some of them will get better. Java was always more gate kept.

Second edit: Java’s intermediate programmer malaise was of course fueled by the Gang of Four’s promise to lead them out of confusion and into the blessed realm of reusable software.

What are non legacy codebases using, then?

Largely Vulkan. Microsoft internally is a huge consumer of DirectML for specifically the LLM team doing Phi and the Copilot deployment that lives at Azure.

I'm not sure if it's just the implementation, but I tried using llama.cpp on Vulkan and it is much slower than using it on CUDA.

It is on Nvidia. Nvidia's code generation for Vulkan kind of sucks, it also effects games. llama.cpp is almost as optimal as it can be on the Vulkan target; it uses VK_NV_cooperative_matrix2, which turning that off loses something like 20% performance. AMD does not implement this extension yet, and due to better matrix ALU design, might not actually benefit from it.

Game engines that have singular code generator paths that support multiple targets (eg, Vulkan, DX12, Xbone/XSX DX12, and PS4/5 GNM) have virtually identical performance on the DX12 and Vulkan outputs on Windows on AMD, have virtually identical performance on apples-to-apples Xbox to PS comparisons (scaled to their relative hardware specs), and have expected DX12 but not Vulkan performance on Windows on Nvidia.

Now, obviously, I'm giving a rather broad statement on that, all engines are different, some games on the same engine (especially UE4 and 5) are faster than one or the other on AMD, or purely faster entirely on any vendor, and some games are faster on Xbox than on PS, or vice versa, due to edge cases or porting mistakes. I suggest looking at, GamersNexus's benchmarks when looking at specific games or DigitalFoundry's work on benchmarking and analyzing consoles.

It is in Nvidia's best interest to make Vulkan look bad, but even now they're starting to understand that is a bad strategy, and the compute accelerator market is starting to become a bit crowded, so the Vulkan frontend for their compiler has slowly been getting better.

Such a huge consumer that they deprecated it

sooo what's the successor of cuda?

CUDA largely was Nvidias attempt at swaying Khronos and Microsoft's DirectX team. In the end, Khronos went with something based on a blend of AMD's and Nvidia's ideas, and that became Vulkan, and Microsoft just duplicated the effort in a Direct3D-flavored way.

So, just use Vulkan and stop fucking around with the Nvidia moat.

A great thing about CUDA is that it doesn't have to deal with any of the graphics and rendering stuff or shader languages. Vulkan compute is way less dev friendly than CUDA. Not to mention the real benefit of CUDA which is that it's also a massive ecosystem of libraries and tools.

As much as I wish it were otherwise, Vulkan is nowhere near a good alternative for CUDA currently. Maybe eventually, but not without additions to nothing the core API and especially available libraries.

ROCm doesn't work on this device

You mean the AI Max chips? ROCm works fine there, as long as you're running 6.4.1 or later, no hacks required. I tested on Fedora Rawhide and it was just dnf install rocm.

Yes it does. ROCm support for new chips, due to being available for paid support contracts, comes like 1-2 months after the chip comes out (ie, when they're 100% sure it works with the current, also new, driver).

I'd rather it works and ships late than doesn't work and ships early and then get gaslit about the bugs (lol Nvidia, why are you like this?)