I admit PSI wasn't on our radar for this specific issue. We've been staring at RSS and page fault counters, but they are indeed too noisy in an mmap-heavy workload.
Checking /proc/pressure/memory to distinguish between 'healthy caching' and 'thrashing' sounds exactly like the signal we are missing. We will try to incorporate some pressure metrics into the node's health report. Thanks for the pointer.
But still, too many metrics for us to balance