Because unless your application is 90% memcpy, it's simply not relevant in a real world senario since it doesn't matter if it takes 2 cycles or (up to 50 in some cases) - the performance will be identical.
Because unless your application is 90% memcpy, it's simply not relevant in a real world senario since it doesn't matter if it takes 2 cycles or (up to 50 in some cases) - the performance will be identical.
This is a library - it doesn't know whether the app is sending one message or 10k per second. But ideally it would be as good as possible in the second case.
Also, for some uses the small time usages add up. If you're doing real time rendering or simulations, you get a small per-frame time budget. Either you hit it or not, so even tiny improvements may matter.
The conclusion was is to not bother and to use something purpose-specific if you do in-fact need performance. You can generate the perfect memcpy to copy any kind of data structure technically speaking and if I remember llvm has a few tricks for that.
Anyway, the original point was that benchmarks are useless since memcpy is almost never used in isolation. And you will always be able to achieve better performance when you know what the data is in advance (as show in the article).