modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.

[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...

How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?