Have you measured Pressure Stall Information or active pages from /proc/meminfo?
Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.
That's perfect for machine learning.
I admit PSI wasn't on our radar for this specific issue. We've been staring at RSS and page fault counters, but they are indeed too noisy in an mmap-heavy workload.
Checking /proc/pressure/memory to distinguish between 'healthy caching' and 'thrashing' sounds exactly like the signal we are missing. We will try to incorporate some pressure metrics into the node's health report. Thanks for the pointer.
But still, too many metrics for us to balance
Don't use Active/Inactive pages from /proc/meminfo. They don't represent the actual size of active/inactive memory.