> The operation of copying data is super easy to parallelize across multiple threads. […] This will make the copy super-fast especially if the CPU has a large core count.
I seriously doubt that. Unless you have a NUMA system, a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller. If you can avoid going through main memory – e.g., when copying between the L2 caches of different cores – multi-threading can speed things up. But then you need precise knowledge of your program's memory access behavior, and this is outside the scope of a general-purpose memcpy.
> a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller.
Modern x86 machines offer far more memory bandwidth than what a single core can consume. The entire architecture is designed on purpose to ensure this.
The interesting thing to note is that this has not always been the case. The 2010s is when the transition occurred.
Some modern non-x86 machines (and maybe even some very recent x86 ones) can't even saturate their system memory bandwidth with all of their CPU cores running at full tilt, they'd need to combine both CPU and non-CPU access for absolute best performance.
I've experienced modest but significant improvements in speed using very basic pragma omp section style parallelizing of this sort of thing.
Do you remember any specifics? For example, the size of the copy, whether it was a NUMA system, or the total bandwidth of your system RAM?