The most ideal arrangement is one in which you do not need to use the memory subsystem in the first place. If two threads need to communicate back-forth with each other in a very tight loop in order to get some kind of job done, there is almost certainly a much faster technique that could be ran on a single thread. Physically moving the information between the cores of processing is the most expensive part. You can totally saturate the memory bandwidth of a Zen chip with somewhere around 8-10 cores if they're all going at a shared working set really aggressively.

Core-to-Core communication across infinity fabric is on the order of 50~100x slower than L1 access. Figuring out how to arrange your problem to meet this reality is the quickest path to success if you intend to leverage this kind of hardware. Recognizing that your problem is incompatible can also save you a lot of frustration. If your working sets must be massive monoliths and hierarchical in nature, it's unlikely you will be able to use a 256+ core monster part very effectively.

Note that none of the CPUs in the article have that Zen architecture.

One of the most interesting and poorly exploited features of these new Intel chips is that four cores share an L2 cache, so cooperation among 4 threads can have excellent efficiency.

They also have user-mode address monitoring, which should be awesome for certain tricks, but unfortunately like so many other ISA extentions, it doesn't work. https://www.intel.com/content/www/us/en/developer/articles/t...

One of the use cases for Clickhouse and related columnar stores is simply to process all your data as quickly as possible where “all” is certainly more than what will fit in memory and in some cases more than what will fit on a single disk. For these I’d expect the allocator issue is contention when working with the MMU, TLB, or simply allocators that are not lock free (like the standard glibc allocator). Where possible one trick is to pre-allocate as much as possible for your worker pool so you get that out of the way and stop calling malloc once you begin processing. If you can swing it you replace chunks of processed data with new data within the same allocated area. At a previous job our custom search engine did just this to scale out better on the AWS X1 instances we were using for processing data.