There are companies whose whole job right now is to optimize kernels so that things run faster. I wonder if those companies are going to be dethroned by some sort of like open source library that can do that really well (I bet Nvidia could release it any day.).. or if they're going to thrive and be acquired by the big providers as a `moat` to speed up their infrerence.

Near-term acquihires are certainly a likely bet I think. But given model progress on related benchmarks like kernelbench [1], I do think a set of more commoditized solutions is also inevitable.

The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently.

[1] https://kernelbench.com/

When you run CUDA at scale dealing with nvidia driver and library bugs takes up a disgustingly large percentage of engineer time, I don't know a lot of people who would be looking forward to rely on more nvidia libraries.

fair point, but are there alternatives that aren't CUDA locked?

Is there an issue board for these bugs? I want to see what is a disgustingly large percent. 50%?

Probably not, because the specifics of the workload - exact parameters, representation of data in memory, value ranges etc - lead you to highly divergent optimization strategies.

shouldn't it be possible to be run as a mlautoresearch project? i.e. orchestrate 10 strategies to speed it up, run in paralellel, pick the winning and go from there?

You are assuming all problems in the world are solvable by one of "10 strategies".