The graph at the end seems pretty dubious. For example, for the AvxUnrollCopier, why does data transfer speed jump to >120gb/s for 4kb, then down to ~50gb/s for 32kb, then down to <20gb/s for 16mb? It just doesn't make sense.

The L1 cache is faster than the L3 cache. Does it need to be anything more complicated than that?