Wait, I thought memcpy would have launched some sort of built-in mechanism (parallelized or whatever) to copy in RAM.

Just indicate the start and length. Why would the CPU need to keep issuing copy instructions?

The problem is that the built-in mechanism is often microcode, which is still slower than plain machine code in some cases.

There are some interesting writings from a former architect of the Pentium Pro on the reasons for this. One is apparently that the microcode engine often lacked branch prediction, so handling special cases in the microcode was slower than compare/branch in direct code. REP MOVS has a bunch of such cases due to the need to handle overlapping copies, interrupts, and determining when it should switch to cache line sized non-temporal accesses.

More recent Intel CPUs have enhanced REP MOVS support with faster microcode and a flag indicating that memcpy() should rely on it more often. But people have still found cases where if the relative alignment between source and destination is just right, a manual copy loop is still noticeably faster than REP MOVS.

The poster has a Zen 2, where this is only optimal for large copies. For newer Intel, glibc might indeed choose to use REP MOVSB more often.

I thought memcpy would have launched some sort of built-in mechanism

Where did you get this impression?

I'd expect memcpy calls to turn into builtin_memcpy and then into raw loads/stores for known small N and a call into compiler-rt for unknown or large N. If it doesn't, patches to do that for your architecture are likely appreciated.

Calling a function with 'builtin' in the name doesn't mean it's embedded in the CPU itself to run concurrently which I think is what they thought might exist.

From my college days, which were quite long ago. And working with Win32 "BitBlt" requests to the OS, etc.

And also, it would just make sense. If copying entire blocks or memory pages, such as "BitBlt", is one command, why would I need CPU cycles to actually do it? It would seem like the lowest hanging fruit to automate in SDRAM

It just seems like the easiest example of SIMD

These are contradictory things. SIMD instructions are still regular instructions, not some concurrent system for copying. When you say command, maybe you meant a windows OS function that was similar to memcpy. An OS function and individual CPU instructions are two different thing. There is something called DMA, but I don't know how much that is used for memory to memory copies.

Well CPUs already transparently handle memory paging so why not copying?

https://en.wikipedia.org/wiki/Memory_paging

I'm not making a case for anything I'm just explaining what exists. If copying were going to be done in bulk it would have to be done asynchronously to some extent, though CPUs already work like that on a small scale due to instruction reordering.

Now it might be less necessary because CPUs are so fast with contiguous data memory that copying to other parts of memory are less of a bottleneck.