Thought about zero-copy IPC recently. In order to avoid memcopy for the complete chain, I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created. Is this a standard thing in such optimized IPC and which libraries offer this?
IPC libraries often specifically avoid zero-copy for security reasons. If a malicious message sender can modify the message while the receiver is in the middle of parsing it, you have to be very careful not to enable time-of-check-time-of-use attacks. (To be fair, not all use cases need to be robust against a malicious sender.)
On Linux, that's exactly what `memfd` seals are for.
That said, even without seals, it's often possible to guarantee that you only read the memory once; in this case, even if the memory is technically mutating after you start, it doesn't matter since you never see any inconsistent state.
It is very easy for zero-copy IPC using sealed memfd to be massively slower than just copying, because of the cost associated with doing a TLB shootdown on munmap. In order to see a benefit over just writing into a pipe, you'd likely need to be sending gigantic blobs, mapping them in both the reader and write into an address space that isn't shared with any other threads that are doing anything, and deferring and batching munmapping (and Linux doesn't really provide you an actual way to do this, aside from mapping them all in consecutive pages with MAP_FIXED and munmapping multiple mappings with a single call).
Any realistic high-performance zero copy IPC mechanism needs to avoid changing the page tables like the plague, which means things like memfd seals aren't really useful.
Thanks for the reference! I had been wondering if there was a way to do this on Linux for years. https://lwn.net/Articles/591108/ seems to be the relevant note?
What's the threat model where a malicious message sender has write access to shared memory
When you are using the shared memory to communicate with an untrusted sender. Examples might include:
- browser main processes that don't trust renderer processes
- window system compositors that don't trust all windowed applications, and vice versa
- database servers that don't trust database clients, and vice versa
- message queue brokers that don't trust publishers and subscribers, and vice versa
- userspace filesystems that don't trust normal user processes
How would someone send a message over shared memory without write access to that memory?
I think he meant what's the scenario where you're using IPC via shared memory and don't trust both processes. Basically it only applies if the processes are running as two different users. (I think Android does that a lot?)
> I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created.
On an SMP system yes. On a NUMA system it depends on your access patterns etc.
I've been meaning to look at Iceoryx as a way to wrap this.
Pytorch multiprocessing queues work this way, but it is hard for the sender to ensure the data is already in shared memory, so it often has a copy. It is also common for buffers to not be reused, so that can end up a bottleneck, but it can, in principle, be limited by the rate of sending fds.
I've looked into this a bit - the big blocker isn't on the transport/IPC library, but the serializer itself, assuming you _also_ want to support serializing messages to disk or over network. It's a bit of a pickle - at least in C++, tying an allocator to a structure and its children is an ugly mess. And what happens if you do something like resize a string? Does it mean a whole new allocation? I've (partially) solved it before for single process IPC by having a concept of a sharable structure and its serialization type, you could do the same for shared memory. One could also use a serializer that offers promises around allocations, FlatBuffer might fit the bill. There's also https://github.com/Verdant-Robotics/cbuf but I'm not sure how well maintained it is right now, publicly.
As for allocation - it looks like Zenoh might offer the allocation pattern necessary. https://zenoh-cpp.readthedocs.io/en/1.0.0.5/shm.html TBH most of the big wins come from not copying big blocks of memory around from sensor data and the like. A thin header and reference to a block of shared memory containing an image or point cloud coming in over UDS is likely more than performant enough for most use cases. Again, big wins from not having to serialize/deserialize the sensor data.
Another pattern which I haven't really seen anywhere is handling multiple transports - at one point I had the concept of setting up one transport as an allocator (to put into shared memory or the like) - serialize once to shared memory, hand that serialized buffer to your network transport(s) or your disk writer. It's not quite zero copy but in practice most zero copy is actually at least one copy on each end.
(Sorry, this post is a little scatterbrained, hopefully some of my points come across)
This is one of mmap's designed-for use cases. Look at DPDK maybe.
Boost.Interprocess:
https://www.boost.org/doc/libs/1_46_0/doc/html/interprocess/...