> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines
I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.
Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule
AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.