Open Bug 1603307 Opened 5 years ago Updated 2 years ago

Migrate from SECCOMP_RET_TRAP to SECCOMP_RET_USER_NOTIF

Categories

(Core :: Security: Process Sandboxing, task, P5)

Unspecified
Linux
task

Tracking

()

People

(Reporter: jld, Assigned: jld)

References

Details

Currently, for sandboxing functionality that seccomp-bpf can't handle internally, we use SECCOMP_RET_TRAP to raise SIGSYS and have the signal handler take the place of the syscall implementation, emulating the syscall either in-process (like returning a pre-opened fd for open in GMP) or by messaging another process (like the file broker used for open in content processes).

However, there are problems with this approach: if SIGSYS is blocked, the kernel will kill the process instead, and even if we prevent that, it's possible that that context genuinely wouldn't have been safe to call the signal handler (if the stack is too small or the stack pointer is invalid, if language runtime state like thread-local storage is inconsistent, etc.). See bug 1600574 for an example of this, and discussion on the glibc mailing list has explored the larger space of things that could go wrong.

The least bad solution seems to be to use SECCOMP_RET_USER_NOTIF, a new feature in kernel 5.0 where the thread making the syscall is stopped while another process (via a special file descriptor associated with the policy) is notified and can inspect the arguments, take arbitrary actions, and supply a return value. It's the same basic idea of a virtual syscall handler in userspace, but in this case it's in another process; the glibc mailing list thread mentioned above has a longer explanation.

One risk here is that we'll need to use ptrace-like features (probably /proc/self/mem as opened by the sandboxed process, or process_vm_{read,write}v) to access memory parameters, and we'll probably run into distributions or container systems or local configuration that breaks that.

(A related subtlety is that the sandboxed process cannot ever have access to its own notifier fd after it becomes untrusted, because the newer feature SECCOMP_USER_NOTIF_FLAG_CONTINUE, introduced in what will become kernel 5.5, would let it force a blocked syscall to be allowed instead. It's arguably not a good idea to have the sandboxed process handle its own notifications on another thread, because of the possibility of deadlock (contrast SECCOMP_RET_TRAP being safely reentrant), but the existence of that CONTINUE feature — and anything in the future with a similar security model — makes it impossible.)

For compatibility with older kernels, we can use SECCOMP_RET_TRAP to implement similar functionality: send the context as a struct seccomp_data to the parent and receive the return value. As long as we never have to deal with future a SIGSYS-intolerant libc combined with a pre-5.0 kernel, this should suffice.

To return (or accept) file descriptors in this model, we'll need the sandboxed process to run a kind of ‘reverse broker’ that the parent process calls while handling the child process's other thread's syscall. Future kernels might let us do this directly, but we'll still need compatibility for pre-5.0 so it doesn't help much except as an optimization.

This is going to be a nontrivial project, and this bug will probably end up as a meta-bug: land a framework, then convert syscall handlers incrementally and preffed for extended testing on nightly/beta, etc.

For implementation: in my opinion we should use Rust for this, because it's going to be new code that's mostly self-contained, and we're handling complex untrusted data in an unsandboxed process; see also the “Rule of 2”. (Yes, our existing file broker is in C++. It shouldn't be, and I would have argued to rewrite it when it gained complicated string parsing to deal with desktop, but oxidation wasn't quite ready yet.) The other alternative is to wait for Chromium to do something and hope that we can integrate it without too much pain, but they're less affected by this class of problem than we are so that may not be a good option.

Priority: -- → P5

User notify has an interesting deficiency, reported last year on LKML: if the target's syscall is interrupted by a signal, the usual EINTR things happen and the SECCOMP_IOCTL_NOTIF_SEND will fail but any other actions taken by the agent will still have happened. In particular, if the agent has given a file descriptor to the target, either with SECCOMP_IOCTL_NOTIF_ADDFD or in userspace with SCM_RIGHTS, it will be leaked unless the two work together to close it. (I'm using agent/target here instead of broker/client to match the terms used in the LKML post.)

Kernel 5.14 will add a feature to send a fd and return from the syscall atomically, but it hasn't been released yet; contrast 5.9 for basic addfd (2020-10-11), or 5.0 for just user notify without addfd (2019-03-03).

So, if we target 5.0 we'll need the full “reverse broker” strategy to handle calls like open, whereas if we target 5.9 we'll still need a way to tell the child process to close fds but won't need SCM_RIGHTS handling — and it can be done async, so we could use regular IPC instead of needing a custom mechanism and worrying about deadlock.

(As for compatibility, because my original story assumed the full reverse-broker approach: instead of trying to migrate the entire policy and maintain a universal compatibility mode, a lower-effort possibility is to convert the places where we have known problems with system libraries blocking signals, and live with that smaller amount of code duplication.)

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.