This is awesome! Normally, offloading layers to the CPU RAM means that the compute for those layers occurs on the CPU instead of the GPU, generally speaking. The CPU is orders of magnitude slower than the GPU.
With this approach the compute occurs on the GPU, with the tradeoff that layers in RAM have to be moved back-and-forth through PCI-DMA. It seems to me that this should offer a speedup vs compute split between GPU and CPU. The amount of speedup will depend on how many layers would have been on CPU compute, minus the reduction due to moving those layers between RAM and the GPU.
What's slower? Compute on the CPU or moving data from RAM to GPU through PCI-DMA?