PCIe already allows DMA between peers on the bus, but, as you pointed out, the traces for the lanes have to terminate somewhere. However, it doesn't have to be the CPU (which is, of course, the PCIe root in modern systems) handling the traffic - a PCIe switch may be used to facilitate DMA between devices attached to it, if it supports routing DMA traffic directly.
That’s what happened in TFA.
You're right. Let me correct myself: a hobbyist-friendly hardware solution. Dolphin's PCIe switches cost more than 8 RTX 3090 on a Threadripper machine.
Jeff forgot to mention that in his post!