First - nice writeup which goes into a lot of nooks and crannies.
That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:
https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int...
but for a much more readable, and still fully transparent modern-C++ API version of the same, try this:
https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa...
that's a sample program for my CUDA API wrappers (header-only) library.
I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders. It's fun to develop while being able to change the code at runtime.
> I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders.
It is also much more friendly for library authors; and easier to wrap; and actually exposes a bunch of features the "runtime API" doesn't.
The difficulty with it is that there just so many API calls; dozens of calls just for copying, for example. That was part of my motivation for writing my wrappers - making the supposedly "lower-level" API more accessible and intuitive than the supposedly "higher-level" API; and better integrated with the other libraries: NVTX, NVRTC, PTX compiler, fatbin library etc.
> It's fun to develop while being able to change the code at runtime.
It's also _the_ way to debug your kernels: If you don't load them dynamically, you have to recompile your application or kernel test harness every time you make a change to the kernel.