I disagree. Yes, the global state is bad, but pipelines, render passes, and worst of all static bind groups and layouts, are by no means better. Why would I need to create bindGroups and bindGroup layouts for storage buffers? They're buffers and references to them, so let me just do the draw call and pass references to the ssbos as argument, rather than having to first create expensive bindings, with the need to cache them because they are somehow expensive.
Also, compute could have easily been added to WebGL, making WebGL pretty much on-par with WebGPU, just 7 years earlier. It didn't happen because WebGPU was supposed to be a better replacement, which it never became. It just became something different-but-not-better.
If you'd have to do even half of all the completely unnecessary stuff that Vulkan forces you to do in CUDA, CUDA would have never become as popular as it is.
I agree with you in that I think there's a better programming model out there. But using a buffer in a CUDA kernel is the simple case. Descriptors exist to bind general purpose work to fixed function hardware. It's much more complicated when we start talking about texture sampling. CUDA isn't exactly great here either. Kernel launches are more heavyweight than calling draw precisely because they're deferring some things like validation to the call site. Making descriptors explicit is verbose and annoying but it makes resource switching more front of mind, which for workloads primarily using those fixed function resources is a big concern. The ultimate solution here is bindless, but that obviously presents it's own problems for having a nice general purpose API since you need to know all your resources up front. I do think CUDA is probably ideal for many users but there are trade-offs here still.
It didn't happen because of Google, Intel did the work to make it happen.