DMA works for devices, because the device does the memory access. RAM to RAM DMA would need something to do the accesses.
The other reason DMA works for devices is because it is asynchronous. You give a device a command and some memory to do it with, it does the thing and lets you know. Most devices can't complete commands instantaneously, so we know we have to queue things and then go do something else. Often when doing memcpy, we want to use the copied memory immediately... if it were a DMA, you'd need to submit the request and wait for it to complete before you continued... If your general purpose DMA engine is a typical device, you're probably doing a syscall to the kernel, which would submit the command (possibly through a queue), suspend your process, schedule something else and there may be delay before getting scheduled again when the DMA is complete.
If async memcpy was what was wanted, it could make sense, but that feels pretty hard to use.
> DMA works for devices, because the device does the memory access. RAM to RAM DMA would need something to do the accesses.
Isn't a blitter exactly that sort of device? Assuming that it can access the relevant RAM, why couldn't that be used for general-purpose memory copying operations?
Yes, but PCs have only rarely had general purpose blitters. They were integrated in some video cards, but that's more or less like DMA; Intel had one for a while recently [1]; FreeBSD loads a driver for it on my Xeon L5640 hosted server, but I don't see any evidence that anything actually uses it. and I'm not sure there was enough actual performance improvement enabled by offloading copies, so Intel stopped including these. Linux marked their driver as broken because it caused issues with copy-on-write [2]
[1] https://lwn.net/Articles/162966/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...