There's a simple solution: don't use mmap(). There's a reason that databases use O_DIRECT to read into their own in memory cache. If it was Good Enough for Oracle in the 1990s, it's probably Good Enough for you.
mmap() is one of those things that looks like it's an easy solution when you start writing an application, but that's only because you don't know the complexity time bomb of what you're undertaking.
The entire point of the various ways of performing asynchronous disk I/O using APIs like io_uring is to manage when and where blocking of tasks for I/O occurs. When you know where blocking I/O gets done, you can make it part of your main event loop.
If you don't know when or where blocking occurs (be it on I/O or mutexes or other such things), you're forced to make up for it by increasing the size of your thread pool. But larger thread pools come with a penalty: task switches are expensive! Scheduling is expensive! AVX 512 registers alone are 2KB of state per task, and if a thread hasn't run for a while, you're probably missing on your L1 and L2 caches. That's pure overhead baked into the thread pool architecture that you can entirely avoid by using an event driven architecture.
All the high performance systems I've worked on use event driven architectures -- from various network protocol implementations (protocols like BGP on JunOS, the HA functionality) to high speed (persistent and non-persistent) messaging (at Solace). It just makes everything easier when you're able to keep threads hot on locked to a single core. Bonus: when the system is at maximum load, you remain at pretty much the same number of requests per second rather than degrading as the number of threads ready to run starts increasing and wasting your CPU resources needlessly when you need them most.
It's hard to believe that the event queue architecture I first encountered on an Amiga in the late 1980s when I was just a kid is still worth knowing today.
Relevant: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
You're right. O_DIRECT is the endgame, but that's a full engine rewrite for us.
We're trying to stabilize the current architecture first. The complexity of hidden page fault blocking is definitely what's killing us, but we have to live with mmap for now.
I am curious -- what is the application and the language it's written in?
There are insanely dirty hacks that you could do to start controlling the fallout of the page faults (like playing games with userfaultfd), but they're unmaintainable in the long term as they introduce a fragility that results in unexpected complexity at the worst possible times (bugs). Rewriting / refactoring is not that hard once one understands the pattern, and I've done that quite a few times. Depending on the language, there may be other options. Doing an mlock() on the memory being used could help, but then it's absolutely necessary to carefully limit how much memory is pinned by such mappings.
Having been a kernel developer for a long time makes it a lot easier to spot what will work well for applications versus what can be considered glass jaws.
There is a database that uses `mmap()` - RavenDB. Their memory accounting is utter horror - they somehow use Commited_AS from /proc/meminfo in their calculations. Their recommendation to avoid OOMs is to have swap twice the size of RAM. Their Jepsen test results are pure comedy.
LMDB uses mmap() as well, but it only supports one process holding the database open at a time. It's also not intended for working sets larger than available RAM.