Vulkan is at 95% of CUDA performance already. The remaining 5% is CUDA's small dispatch logic.

The reason why people continue to use CUDA and Pytorch and so on is because they are literally too stupid and too lazy to do it any other way

With zero tooling, hence why no one cares about Vulkan, other than Valve and Google.

What tooling do you need? I'll make it for you for free

Great, lets start with a Fortran compiler like CUDA has.

When you're done, you can create IDE plugins, and a graphical debugger with feature parity to NInsights.

Ok, that's a good retort. How many months of work do those things save you, compared to actually solving the problem you want to solve without those tools?

The argument you are making sounds to me like, "well good luck making a Vulkan application without cmake, ninja, meson, git, visual studio, clion" etc, when in reality a 5 line bash script to gcc works just fine

Wrong analogy. You have no idea how wrong you are. Just look at the difference in performance analysis tools for AMD and Nvidia for GPUs. Nvidia makes it simple for people to write GPU programs.

I do have an idea of how wrong I am.

Nvidia's own people are the ones who have made Vulkan performance so close to CUDA's. AMD is behind, but the data shows that they're off in performance proportional to the cost of the device. If they implement coop mat 2, then they would bridge the gap.

99.9% of people who use Pytorch and so on could achieve good enough performance using a "simple vulkan backend" for whatever Python stuff they're used to writing. That would strip out millions of lines of code.

The reason nobody has done this outside of a few github projects that Nvidia themselves have contributed to, is because there isn't a whole lot of money in iterative performance gains, when in reality better algorithmic approaches are being invented quite near every month or so.

First step is to understand why proprietary technology gets adoption.

Lacking understanding is doomed to failure.