You can copy that way.

It's faster of you use the CPU, but you absolutely can just use DMA - and some embedded systems do.

> It's faster of you use the CPU

But not for AMD? E.g. 8 Zen 5 cores in the CCD have only 64 GB/s read and 32 GB/s write bandwidth, while the dual-channel memory controller in the IOD has up to 87 GB/s bandwidth.

The issue is that a DMA setup:

A: requires the DMA system to know about each user process memory mappings (ie hardware support understanding CPU pagetables)

B: spend time going from user-kernelmode and back (we invented the entire io_uring and other mechanisms to avoid that).

To some extent I guess the IOMMU's available to modern graphics cards solve it partially but I'm not sure that it's a free lunch (ie it might be partially in driver/OS level to manage mappings for this).