Hacker News

Solaris had a unified page cache, and ARC existed separately, along side of it there as well.

One huge problem with ZFS is that there is no zero copy due to the ARC wart. Eg, if you're doing sendfile() from a ZFS filesystem, every byte you send is copied into a network buffer. But if you're doing sendfile from a UFS filesystem, the pages are just loaned to the network.

This means that on the Netflix Open Connect CDN, where we serve close to the hardware limits of the system, we simply cannot use ZFS for video data due to ZFS basically doubling the memory bandwidth requirements. Switching from UFS to ZFS would essentially cut the maximum performance of our servers in half.

arghwhat 2 days ago [ - ]

I also imagine you wouldn't benefit from ZFS there either, even if the ARC wasn't there. You have a single application and can presumably accept occssional data loss (just fetch content upstream). Just need to handle bitrot detection, but there's ways to get around that application-side.

Better to just have the filesystem get out of the way and just focus on being good at raw I/O scheduling.

I wonder if FreeBSD is going to get something io_uring-esque. That's one of the more interesting developments in kernel space...

drewg123 2 days ago [ - ]

There are benefits to ZFS for spinning drives. Eg, a metadata-only L2 ARC on an NVME drive, with the data coming from a spinning drive would likely perform better than UFS. This is because with UFS, the head has to move around at times to read metadata, where with ZFS and metadata cached on NAND, it could just read data in an ideal case.

FreeBSD has "fire and forget" behavior in the context of several common hot paths, so the need for io_uring is less urgent. Eg, sendfile is "fire and forget" from the application's perspective. If data is not resident, the network buffers are pre-allocated and staged on the socket buffer. When the io completes, the disk interrupt handler then flips the pre-staged buffers (whose pages now contain valid data) to "ready" and pokes the TCP state machine.

Similarly, FreeBSD has OpenBSD inspired splice which is fire-and-forget once 2 sockets are spliced.

arghwhat a day ago [ - ]

For filesystems where in-memory cache is insufficient, maybe a generic ephemeral inode/dentry cache system - L2ARC without zfsisms - would be useful...

But to be fair, we are approaching the point where spinning rust stops making sense for even the remaining use cases, and so designing new optimizations specifically for it might be a bit silly now.