memcpy (and the other string routines) are some of the library functions that most benefit from heavy optimisation and tuning for specific CPUs -- they get hit a lot, and careful adjustment of the code can get major performance wins by ensuring that the full memory bandwidth of the CPU is being used (which may involve using specific load instructions, deciding whether using the simd registers is better or not, and so on). So everybody who cares about performance optimises these routines pretty carefully, regardless of toolchain/OS. For instance the glibc versions are here:

https://github.com/bminor/glibc/tree/master/sysdeps/aarch64/...

and there are five versions specialised for either specific CPU models or for available architecture features.