Interesting article, but it compares apples to a fruit stand: The approach could be improved by comparing Capsicum to using seccomp in the same way.
Sometime ago I wrote a library for a customer that did exactly that: Open a number of resources, e.g., stdin, stdout, stderr, a pipe or two, a socket or two, make the seccomp calls necessary to restrict the use of read/write/etc. to the associated file descriptors, then lock out all other system calls - which includes seccomp-related calls.
Basically, the library took a very Capsicum-like approach of whitelisting specific actions then sealing itself against further changes.
This is a LOT of work, of course, and the available APIs don't make it particularly easy or elegant, but it is definitely doable. I chose this approach because the docker whitelist approach was far too open ended and "uncurated", if you will, for the use-case we were targeting.
In this particular case, I was aided by the fact the library was written to support the very specific use-case of filters running in containers using FIFOs for IPC, logging, and reporting: Every filter saw exactly the same interfaces to the world, so it was relatively easier to lock things down.
Having said that, I wish Linux had a Capsicum-equivalent call, or, even better for the approach I took, a friendlier way to whitelist specific calls.
A problem with that approach is that libc can after an upgrade decide to start doing syscalls you were not expecting. Like the first time you call `printf()` it calls `newfstatat()`. Only the first time. Maybe in the future it'll call it more often than that, and then your binary breaks.
I'm not sure what glibc's latest policy is on linking statically, but at least it used to be basically unsupported and bugs about it were ignored. But even if supported, you can't know if it under some configurations or runtime circumstances uses dlopen for something.
Or maybe once you juggle more than X file descriptors some code switches from using `poll()` to using `select()` (or `epoll()`).
My thoughts last time I looked at seccomp: https://blog.habets.se/2022/03/seccomp-unsafe-at-any-speed.h...
This is a problem but fwiw libc's should be falling back to old system calls. You can block clone3 today and see that your libc will fall back to clone.
> A problem with that approach is that libc can after an upgrade decide to start doing syscalls you were not expecting.
That would break capsicum, too, so I don’t see how that’s a problem when “comparing Capsicum to using seccomp in the same way”.
That's the approach I meant by "that approach", the library the parent commenter was talking about writing for a customer. Compare this to Landlock or OpenBSDs pledge/unveil.
Now that Landlock actually is a thing, have you considered writing another followup? Given what I've seen of landlock, I expect it'll be spicy...
I took the bait.
“The goal of Landlock is to enable restriction of ambient rights (e.g. global filesystem or network access) for a set of processes. Because Landlock is a stackable LSM [(Linux Security Model)], it makes it possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. ... Landlock empowers any process, including unprivileged ones, to securely restrict themselves.”
https://docs.kernel.org/userspace-api/landlock.html
I've actually found it pretty fine. It doesn't have full coverage, but they have a system of adding coverage (ABI versions), and it covers a lot of the important stuff.
The one restriction I'm not sure about is that you can't say "~/ except ~/.gnupg". You have to actually enumerate everything you do want to allow. But maybe that's for the best. Both because it mandates rules not becoming too complex to reason about, and because that's a weird requirement in general. Like did you really mean to give access to ~/.gnupg.backup/? Probably not. Probably best to enumerate the allowlist.
And if you really want to, I guess you can listdir() and compose the exhaustive list manually, after subtracting the "except X".
I find seccomp unusable and not fit for purpose, but landlock closes many doors.
Maybe you know better? I'd love to hear your take.
I definitely don't know better, and after taking a few more looks at landlock, I'm not even sure what my objections were, probably got it confused with something else entirely. Confusion and ignorance on my part I guess.
You can make seccomp mimic Capsicum by whitelisting syscalls and checking FD arguments with libseccomp, but that quickly becomes error prone once you factor in syscall variants and helper calls. Read and write take the FD as arg0 while pread and pwrite shift it, and sendfile, splice and io_uring change semantics, and ioctl or fcntl can defeat naive filters, so you wind up with a huge BPF program and still miss corner cases.
Capsicum attaches rights to descriptors and gives kernel enforced primitives like cap_enter and cap_rights_limit, so delegation is explicit and easier to reason about. If you want Linux parity, use libseccomp to shrink the syscall surface, combine it with mount and user namespaces and Landlock for filesystem constraints, and design your app around FD based delegation instead of trying to encode every policy into BPF.