The full reorder buffer is still going to be only 200-500 instructions. The actual benchmark is not linked, but it would take only a hundred or so messages to largely ignore the reordering. On the other hand, when you use the library, the write needs to actually finish in the shared memory before you notify the other process. So unless the benchmark was tiny for some reason, why would this be irrelevant?
Because unless your application is 90% memcpy, it's simply not relevant in a real world senario since it doesn't matter if it takes 2 cycles or (up to 50 in some cases) - the performance will be identical.
This is a library - it doesn't know whether the app is sending one message or 10k per second. But ideally it would be as good as possible in the second case.
Also, for some uses the small time usages add up. If you're doing real time rendering or simulations, you get a small per-frame time budget. Either you hit it or not, so even tiny improvements may matter.
The conclusion was is to not bother and to use something purpose-specific if you do in-fact need performance. You can generate the perfect memcpy to copy any kind of data structure technically speaking and if I remember llvm has a few tricks for that.
Anyway, the original point was that benchmarks are useless since memcpy is almost never used in isolation. And you will always be able to achieve better performance when you know what the data is in advance (as show in the article).