Why is there a new kernel driver here at all? It appears that all it does it allocate system RAM (“DDR4”) and export it as a dmabuf for import to cuda as mapped external memory. Then a userspace shim hijacks APIs to use that if gpu memory is full. cuda already supports allocating mapped system memory, so AFAICT this could be implemented in the userspace shim with no new kernel driver.

Also as other commenters have mentioned, redirecting allocations to managed memory would also enable similar oversubscription

And the hijack approach only makes sense for making apps have this behavior with no changes, and could be done with minor app changes (e.g. PyTorch has a pluggable allocator interface). App changes also enable intentionally placing specific allocations.

My impression is that this is vibe from beginning to end, starting from a design that only makes sense if you are hallucinating

Maybe theres a significant latency advantage to doing it this way?

Or, as you said, making everything backwards compatible that is not being regularly updated