"That costs ~5 ns when the line is cold"
I don't see how that could possibly be true. Sounds like a low-ball estimate.
Also i wish to point out that the "tcmalloc" being used as a baseline in these performance claims is Ye Olde tcmalloc, the abandoned and now community-maintained version of the project. The current version of tcmalloc is a completely different thing that the mimalloc-bench project doesn't support (correctly; I just checked).
Fair points on both - the 5ns is the L2 hit case. I should have stated the range (30-60ns?) instead of the best case. And yes, fixing the tcmalloc case is on my list - thanks for pointing that out. And also to be clear, the goal was never to beat jemalloc or tcmalloc on raw throughput. I wanted t oshow that one doesn't have t ogive up competitive performnce to get explicit heaps, hard caps and teardown semantics.
That makes sense. I have a long-standing beef with the mimalloc-bench people because they made a bunch of claims in their paper but as recently as 2022 they were apparently not aware of the distinction, and the way they tried to shoehorn tcmalloc into their harness is plain broken. That is not a problem caused by your fine project.