If I understand that chart at the end it looks like the better performance is only for small buffer sizes which fit in the cache (4k) but if you are looking at big buffers the stdlib copy performs about the same as the optimized copy that he writes.