I've never understood the fascination some people have with mmap. Memory-mapped file IO is just a RAM cache combined with a hidden system call (a page fault) to fill the cache. You can do the same thing yourself by using O_DIRECT to fill regular anonymous memory. If you're feeling social, you can fill a mapped and shared memfd.

You can seal memfds too, which means that the "read-only" mode is easy to implement: just map your memfd for write, apply F_SEAL_FUTURE_WRITE, and share the memfd to anyone you want to have read-only access.

By doing your own O_DIRECT IO instead of relying on the kernel's defaults, you get a lot more control. You choose how much readahead to do; you choose your read-cluster size. You choose your cache eviction strategy. You choose when to write back.

BTW: O_DIRECT can also be done asynchronously using aio or io_uring. There's no such thing as an asynchronous page fault. And IO errors? Would you rather deal with EIO or SIGBUS?

Why would you want the kernel to do these things for you? It'll do a worse job: it has less information than you do and has to use blunt heuristics that work sort-of-good-enough for the whole world, not just your program.

And it's not any faster either. O_DIRECT is DMA. A page cache fill is also DMA. It's the same operation, spelled differently.

I use mmap with my SQLite database[1] because I have many concurrent SQLite connections (one per concurrent HTTP request) and I don't want each connection to have its own 2MB cache[2]. It's better that all the connections simply share the page cache.

[1]: https://sqlite.org/pragma.html#pragma_mmap_size

[2]: https://sqlite.org/pragma.html#pragma_cache_size

with mmap you also don't have to worry about committing too much system memory, if another application needs it it will start evicting your cache.

You're right about that.

Linux needs a way for userspace processes to participate in the kernel's shrinker system for reclaiming memory under pressure. Watching memory PSI is too coarse. MADV_FREE is too complicated and indiscriminate. You could imagine a notification FD, but then you've just reinvented PSI. You could imagine a synchronous signal, but everyone hates signals and won't couple any new functionality to them.

Shrinker-BPF attached to a memfd perhaps? A BPF shrinker could not only choose which pages to evict in a non-stupid way, but could notify userspace in some sane manner (e.g. setting a bitmask somewhere) that it's done so.

(Zero-fill as "notification" is insane and doesn't actually work because zero is a perfectly valid value in a lot of contexts.)

> I've never understood the fascination some people have with mmap.

Uncommonly used system calls give user-space programmers the sensation of learning something.

> Why would you want the kernel to do these things for you? It'll do a worse job: it has less information than you do and has to use blunt heuristics that work sort-of-good-enough for the whole world, not just your program.

Yes, you're opting into non-determinism you don't control. When resources get constrained and everything can't be in memory and someone asks you why the database sucks, all you'll be able to do is shrug. Anyone who builds critical systems would never rely on the kernel making decisions like this. Don't use LMDB for anything that matters.

You're already depending on the OS for many other things. Depending on it for page caching is just one more thing.

This level of reasoning is insufficient when building reliable systems. The consequences of depending on the OS for page caching are different than the consequences of depending on it for device drivers, file systems, or scheduling.

Nonsense. The best you will ever do, even with full application knowledge and complete control of the machine, is an LRU cache replacement algorithm. But when you do it yourself you have to juggle the fine details of which indices to prioritize, and you will never get it perfect. If you're not running a dedicated machine, as soon as any other processes run all your careful tuning goes out the window.

Since LMDB manages multiple tables as a tree of trees, no fine tuning is needed. The internal paths to every hot page automatically take priority, regardless of which index or how large each index is. So a simpleminded LRU always makes optimal use of available cache, regardless of access pattern or other load on the system.

First let me just say that while it's possible to interpret my original comment as uniquely applying to LMDB (or databases with similar page cache designs), in practice it applies to all general purpose databases including PostgreSQL and SQLite. This is because all general purpose databases will eventually fall short when it comes to tweaking behavior to meet application specific requirements, customizations notwithstanding. So to the extent that one should not use LMDB for anything that matters, one should also not use PostgreSQL or SQLite for anything that matters. If that corollary appears false in your frame of reference, then my statement about LMDB should also be false.

For high-stakes applications, you will have to maintain your own database code (either original or derived from an existing database) and that database code will need its own page caching layer (or a patched kernel), a generic page caching system (whether in-kernel with mmap or out of kernel) will not do. I acknowledge most applications don't operate in this regime.

> The best you will ever do, even with full application knowledge and complete control of the machine, is an LRU cache replacement algorithm.

This is not true. Applications often have specific high-priority data which should always exist in memory. That may be a moot point because you can do mlock() with mmap(). If we focus only on general-purpose caching, then even in that case there are many alternatives to LRU. SIEVE and ARC are two notable alternatives that perform significantly better for certain data. An application developer should be able to experiment with different general purpose caching strategies for different types of data, mmap() does not afford this.

Thank you Mr. Chu for your contributions to the technology commons and humanity in general.

> The best you will ever do, even with full application knowledge and complete control of the machine, is an LRU cache replacement algorithm

First of all, even the kernel can do better than simple LRU. We have MGLRU now for example. That said, the kernel is at a structural disadvantage.

A general purpose eviction and prefetch algorithm is like an automatic transmission on a car. It can react only to what it's seen.

When you drive stick, you can react to what you can see on the road ahead of you. A database has a query plan. It can see the future as well as remember the past. It has more information than the kernel.

> So a simpleminded LRU always makes optimal use of available cache, regardless of access pattern or other load on the system

That cannot be true. If I have a random access pattern, LRU will perform no better than random. If I have a future-oracle, I can just evict what's most distant in my set of future accesses.

Regardless of whether you're right about the suitability of LRU for this or that workload, it's simply false, mathematically, from a computer science POV, that LRU is optimal.

And if you go around making confidently wrong claims like this, one must wonder about what else you are wrong. If you want to be disagreeable in public, fine: just make sure you have math on your side first.

In the time it takes for your query optimizer to dissect a query and "look ahead" LMDB would have already answered a million queries. You think your magical "future oracle" is zero cost? How many KLOCs is it? LMDB's hot paths fit entirely inside a CPU's L1 cache.

The OS handles all of that transparently, without requiring any additional code. I think that is the draw.

And that's adequate for casual programs. LMDB is big and serious enough to warrant the extra complexity (which, to be fair, is significant) of userspace buffer management. LMDB does the work once and all users benefit.

Obligatory: https://db.cs.cmu.edu/mmap-cidr2022/

Consensus says "don't do it" ...

That said, having written my own buffer pool and paging, etc... in pure naive benchmarks ... it's actually kinda hard to beat mmap. And LMDB is really fast for what it is.

In real world workflows I think the story is more complicated. Especially under higher concurrency.

Obligatory "that paper is garbage" https://www.symas.com/post/are-you-sure-you-want-to-use-mmap...

I read the linked post. You're not making a good argument.

The authors aren't arguing that a mmap database is worse because it's "more complex". They are arguing it must work with less information. You haven't refuted the original paper, but you have made me more skeptical of LMDB.

For example, you claim that applications "never" have control of memory. That's simply, again, false. We have explicit memory eviction and pinning operations. We even have VA-batched TLB shootdown IPIs via process_madvise. On some systems (AMD, soon Intel) we can do TLB invalidation without an IPI.

So no, you're just wrong in making the claim that you might as well use mmap because you can't control the memory lifecycle anyway. You absolutely can, and anyone reading this message can look up the relevant APIs for himself.

And you point to LMDB's benchmarks repeatedly as evidence you're right. That's not saying what you think it is. LMDB is fast despite being hobbled by vanilla kernel mmap. Yes, that means other databases are probably doing stupid things, but reverse stupidity is not intelligence.

You're dreaming. None of your explicit memory control operations mean anything in practice, because today everything runs in VMs with no actual control of the underlying hardware. Probably co-resident with an unknown number of other tenants.

As for what you claim the paper's authors were saying - I quoted their text verbatim. Your interpretation is not what they said.

They claimed using mmap safely is impossible, and using it correctly requires more complexity than a traditional DB design. The safety claim was already disproven by multiple researchers. To prove their second claim they would have had to produce a DB that did traditional buffer management and was simpler and more performant than using mmap. They never did any such thing, nor could they.

"we don't have to pay back any vulture capitalists"! a good one