I'm not sure if it's just the implementation, but I tried using llama.cpp on Vulkan and it is much slower than using it on CUDA.
I'm not sure if it's just the implementation, but I tried using llama.cpp on Vulkan and it is much slower than using it on CUDA.
It is on Nvidia. Nvidia's code generation for Vulkan kind of sucks, it also effects games. llama.cpp is almost as optimal as it can be on the Vulkan target; it uses VK_NV_cooperative_matrix2, which turning that off loses something like 20% performance. AMD does not implement this extension yet, and due to better matrix ALU design, might not actually benefit from it.
Game engines that have singular code generator paths that support multiple targets (eg, Vulkan, DX12, Xbone/XSX DX12, and PS4/5 GNM) have virtually identical performance on the DX12 and Vulkan outputs on Windows on AMD, have virtually identical performance on apples-to-apples Xbox to PS comparisons (scaled to their relative hardware specs), and have expected DX12 but not Vulkan performance on Windows on Nvidia.
Now, obviously, I'm giving a rather broad statement on that, all engines are different, some games on the same engine (especially UE4 and 5) are faster than one or the other on AMD, or purely faster entirely on any vendor, and some games are faster on Xbox than on PS, or vice versa, due to edge cases or porting mistakes. I suggest looking at, GamersNexus's benchmarks when looking at specific games or DigitalFoundry's work on benchmarking and analyzing consoles.
It is in Nvidia's best interest to make Vulkan look bad, but even now they're starting to understand that is a bad strategy, and the compute accelerator market is starting to become a bit crowded, so the Vulkan frontend for their compiler has slowly been getting better.