I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).
tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).
Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.
Are you using async rust, or sync rust?
modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.
[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...
How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?