Conclusion
Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.
----
So that's five minutes I'll never get back.
I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...
0000000000000000 <memcpy>:
0: 86aa mv a3,a0
0000000000000002 <.L1^B1>:
2: 00267757 vsetvli a4,a2,e8,m4,tu,mu
6: 02058007 vle8.v v0,(a1)
a: 95ba add a1,a1,a4
c: 8e19 sub a2,a2,a4
e: 02068027 vse8.v v0,(a3)
12: 96ba add a3,a3,a4
14: f67d bnez a2,2 <.L1^B1>
16: 8082 ret
... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.
> So that's five minutes I'll never get back.
Confirming null hypothesis, with good supporting data is still interesting. Could save you from doing this yourself.
You could read the article and end up disagreeing with it. The value is in grokking over the details and not whether the insight changes your decisions. It can just make your decisions more grounded in data
You pre-stole my comment, I was about to make the exact same post :-D
Although the blog post is about going faster and him showing alternative algorithms, conclusion remains for safety which makes perfect sense. However, he did show us a few strategies which is useful. The five minutes I spent, will never be returned to me but at least I learned something interesting...