BTW, if we copy data between some device and RAM efficiently using DMA without spending CPU cycles, why we can't use DMA to copy RAM-to-RAM?
BTW, if we copy data between some device and RAM efficiently using DMA without spending CPU cycles, why we can't use DMA to copy RAM-to-RAM?
DMA works for devices, because the device does the memory access. RAM to RAM DMA would need something to do the accesses.
The other reason DMA works for devices is because it is asynchronous. You give a device a command and some memory to do it with, it does the thing and lets you know. Most devices can't complete commands instantaneously, so we know we have to queue things and then go do something else. Often when doing memcpy, we want to use the copied memory immediately... if it were a DMA, you'd need to submit the request and wait for it to complete before you continued... If your general purpose DMA engine is a typical device, you're probably doing a syscall to the kernel, which would submit the command (possibly through a queue), suspend your process, schedule something else and there may be delay before getting scheduled again when the DMA is complete.
If async memcpy was what was wanted, it could make sense, but that feels pretty hard to use.
> DMA works for devices, because the device does the memory access. RAM to RAM DMA would need something to do the accesses.
Isn't a blitter exactly that sort of device? Assuming that it can access the relevant RAM, why couldn't that be used for general-purpose memory copying operations?
Yes, but PCs have only rarely had general purpose blitters. They were integrated in some video cards, but that's more or less like DMA; Intel had one for a while recently [1]; FreeBSD loads a driver for it on my Xeon L5640 hosted server, but I don't see any evidence that anything actually uses it. and I'm not sure there was enough actual performance improvement enabled by offloading copies, so Intel stopped including these. Linux marked their driver as broken because it caused issues with copy-on-write [2]
[1] https://lwn.net/Articles/162966/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
You can copy that way.
It's faster of you use the CPU, but you absolutely can just use DMA - and some embedded systems do.
> It's faster of you use the CPU
But not for AMD? E.g. 8 Zen 5 cores in the CCD have only 64 GB/s read and 32 GB/s write bandwidth, while the dual-channel memory controller in the IOD has up to 87 GB/s bandwidth.
The issue is that a DMA setup:
A: requires the DMA system to know about each user process memory mappings (ie hardware support understanding CPU pagetables)
B: spend time going from user-kernelmode and back (we invented the entire io_uring and other mechanisms to avoid that).
To some extent I guess the IOMMU's available to modern graphics cards solve it partially but I'm not sure that it's a free lunch (ie it might be partially in driver/OS level to manage mappings for this).