Hacker News

The problem is that the built-in mechanism is often microcode, which is still slower than plain machine code in some cases.

There are some interesting writings from a former architect of the Pentium Pro on the reasons for this. One is apparently that the microcode engine often lacked branch prediction, so handling special cases in the microcode was slower than compare/branch in direct code. REP MOVS has a bunch of such cases due to the need to handle overlapping copies, interrupts, and determining when it should switch to cache line sized non-temporal accesses.

More recent Intel CPUs have enhanced REP MOVS support with faster microcode and a flag indicating that memcpy() should rely on it more often. But people have still found cases where if the relative alignment between source and destination is just right, a manual copy loop is still noticeably faster than REP MOVS.