How is AMD GPU compatibility with leading generative AI workflows? I'm under the impression everything is CUDA.

There is a project called SCALE that allows building CUDA code natively for AMD GPUs. It is designed as a drop-in replacement for Nvidia CUDA, and it is free for personal and educational use.

You can find out more here: https://docs.scale-lang.com/stable/

There are still many things that need implementing, most important ones being cuDNN and CUDA Graph API, but in my opinion, the list of things that are supported now is already quite impressive (and keeps improving): https://github.com/spectral-compute/scale-validation/tree/ma...

Disclaimer: I am one of the developers of SCALE.

All of Ollama and Stable Diffusion based stuff now works on my AMD cards. Maybe it’s different if you want to actually train things, but I have no issues running anything that fits in memory any more.

llama.cpp combined with Mesa’s Vulkan support for AMD GPUs has worked pretty well with everything I’ve thrown it at.

https://llm-tracker.info/_TOORG/Strix-Halo has very comprehensive test results for running llama.cpp with Strix Halo. This one is particularly interesting:

> But when we switch to longer context, we see something interesting happen. WMMA + FA basically loses no performance at this longer context length!

> Vulkan + FA still has better pp but tg is significantly lower. More data points would be better, but seems like Vulkan performance may continue to decrease as context extends while the HIP+rocWMMA backend should perform better.

lhl has also been sharing these test results in https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-..., and his latest comment provides a great summary of the current state:

> (What is bad is that basically every single model has a different optimal backend, and most of them have different optimal backends for pp (handling context) vs tg (new text)).

Anyway, for me, the greatest thing about the Strix Halo + llama.cpp combo is that we can throw one or more egpu into the mix, as echoed by level1tech video (https://youtu.be/ziZDzrDI7AM?t=485), which should help a lot with PP performance.

In practical generative AI workflows (LLMs), I think AMD Max+395 chips with unified memory is as good as Mac Studio or MacBook Pro configurations in handling big models locally and support fast inference speeds (However Top-end Apple silicon (M4 Max, Studio Ultra) can reach 546GB/s memory bandwidth, while the AMD unified memory system is around 256GB/s). I think for inference use either will work fine. For everything else I think CUDA ecosystem is a better bet (correct me if I'm wrong).

My impression is the same. To train anything you just need to have CUDA gpus. For inference I think AMD and Apple M chips are getting better and better.

For inference, Nvidia/AMD/Intel/Apple are all generally on the same tier now.

There's a post on github of a madman who got llama.cpp generating tokens for an AI model that's running on an Intel Arc, Nvidia 3090, and AMD gpu at the same time. https://github.com/ggml-org/llama.cpp/pull/5321

CUDA isn't really used for new code. Its used for legacy codebases.

In the LLM world, you really only see CUDA being used with Triton and/or PyTorch consumers that haven't moved onto better pastures (mainly because they only know Python and aren't actually programmers).

That said, AMD can run most CUDA code through ROCm, and AMD officially supports Triton and PyTorch, so even the academics have a way out of Nvidia hell.

> CUDA isn't really used for new code.

I don't think this is particularly correct, or at least worded a bit too strongly.

For Nvidia hardware, CUDA just gives the best performance, and there are many optimized libraries that you'd have to replace as well.

Granted, new ML frameworks tend to be more backend agnostic, but saying that CUDA is no longer being used, seems a bit odd.

If you're not doing machine code by hand, you're not a programmer

If you are not winding copper around magnets by hand, you are not a real programmer

I get the joke you two are making, but I've seen what academics have written in Python. Somehow, its worse than what academics used to write when Java was taught has the only language for CompSci degrees.

At least Java has types and can be performant. The world was ever so slightly better back then.

There is some truly execrable Python code out there, but it’s there because the barrier to entry is so low. Especially back in the day, Java had so many guardrails that the really bad Java code came from intermediate programmers pushing up against the limitations of the language rather than from novices pasting garbage into a notebook. As a result there was less of it, but I’m not convinced that’s a good thing.

Edit: my point being that out of a large pool of novices, some of them will get better. Java was always more gate kept.

Second edit: Java’s intermediate programmer malaise was of course fueled by the Gang of Four’s promise to lead them out of confusion and into the blessed realm of reusable software.

What are non legacy codebases using, then?

Largely Vulkan. Microsoft internally is a huge consumer of DirectML for specifically the LLM team doing Phi and the Copilot deployment that lives at Azure.

I'm not sure if it's just the implementation, but I tried using llama.cpp on Vulkan and it is much slower than using it on CUDA.

It is on Nvidia. Nvidia's code generation for Vulkan kind of sucks, it also effects games. llama.cpp is almost as optimal as it can be on the Vulkan target; it uses VK_NV_cooperative_matrix2, which turning that off loses something like 20% performance. AMD does not implement this extension yet, and due to better matrix ALU design, might not actually benefit from it.

Game engines that have singular code generator paths that support multiple targets (eg, Vulkan, DX12, Xbone/XSX DX12, and PS4/5 GNM) have virtually identical performance on the DX12 and Vulkan outputs on Windows on AMD, have virtually identical performance on apples-to-apples Xbox to PS comparisons (scaled to their relative hardware specs), and have expected DX12 but not Vulkan performance on Windows on Nvidia.

Now, obviously, I'm giving a rather broad statement on that, all engines are different, some games on the same engine (especially UE4 and 5) are faster than one or the other on AMD, or purely faster entirely on any vendor, and some games are faster on Xbox than on PS, or vice versa, due to edge cases or porting mistakes. I suggest looking at, GamersNexus's benchmarks when looking at specific games or DigitalFoundry's work on benchmarking and analyzing consoles.

It is in Nvidia's best interest to make Vulkan look bad, but even now they're starting to understand that is a bad strategy, and the compute accelerator market is starting to become a bit crowded, so the Vulkan frontend for their compiler has slowly been getting better.

Such a huge consumer that they deprecated it

sooo what's the successor of cuda?

CUDA largely was Nvidias attempt at swaying Khronos and Microsoft's DirectX team. In the end, Khronos went with something based on a blend of AMD's and Nvidia's ideas, and that became Vulkan, and Microsoft just duplicated the effort in a Direct3D-flavored way.

So, just use Vulkan and stop fucking around with the Nvidia moat.

A great thing about CUDA is that it doesn't have to deal with any of the graphics and rendering stuff or shader languages. Vulkan compute is way less dev friendly than CUDA. Not to mention the real benefit of CUDA which is that it's also a massive ecosystem of libraries and tools.

As much as I wish it were otherwise, Vulkan is nowhere near a good alternative for CUDA currently. Maybe eventually, but not without additions to nothing the core API and especially available libraries.

ROCm doesn't work on this device

You mean the AI Max chips? ROCm works fine there, as long as you're running 6.4.1 or later, no hacks required. I tested on Fedora Rawhide and it was just dnf install rocm.

Yes it does. ROCm support for new chips, due to being available for paid support contracts, comes like 1-2 months after the chip comes out (ie, when they're 100% sure it works with the current, also new, driver).

I'd rather it works and ships late than doesn't work and ships early and then get gaslit about the bugs (lol Nvidia, why are you like this?)

> I'm under the impression everything is CUDA

A very quick Google search would show that pretty much everything also runs on ROCm.

Torch runs on CUDA and ROCm. Llama.cpp runs on CUDA, ROCm, SYCL, Vulkan and others...

Certain chips can work with useful local models, but compatibility is far behind CUDA.

Indeed, recent Flash Attention is a pain point for non CUDA.