Conclusion

Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.

----

So that's five minutes I'll never get back.

I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...

    0000000000000000 <memcpy>:
       0:   86aa                    mv      a3,a0
    
    0000000000000002 <.L1^B1>:
       2:   00267757                vsetvli a4,a2,e8,m4,tu,mu
       6:   02058007                vle8.v  v0,(a1)
       a:   95ba                    add     a1,a1,a4
       c:   8e19                    sub     a2,a2,a4
       e:   02068027                vse8.v  v0,(a3)
      12:   96ba                    add     a3,a3,a4
      14:   f67d                    bnez    a2,2 <.L1^B1>
      16:   8082                    ret
... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.

https://hoult.org/d1_memcpy.txt

> So that's five minutes I'll never get back.

Confirming null hypothesis, with good supporting data is still interesting. Could save you from doing this yourself.

You could read the article and end up disagreeing with it. The value is in grokking over the details and not whether the insight changes your decisions. It can just make your decisions more grounded in data

You pre-stole my comment, I was about to make the exact same post :-D

Although the blog post is about going faster and him showing alternative algorithms, conclusion remains for safety which makes perfect sense. However, he did show us a few strategies which is useful. The five minutes I spent, will never be returned to me but at least I learned something interesting...