The conclusion was is to not bother and to use something purpose-specific if you do in-fact need performance. You can generate the perfect memcpy to copy any kind of data structure technically speaking and if I remember llvm has a few tricks for that.

Anyway, the original point was that benchmarks are useless since memcpy is almost never used in isolation. And you will always be able to achieve better performance when you know what the data is in advance (as show in the article).