Open Bug 986397 Opened 10 years ago Updated 2 years ago

[meta] defense in depth for the Linux sandbox (namespaces, chroot)

Categories

(Core :: Security: Process Sandboxing, defect, P3)

x86_64
Linux
defect

Tracking

()

ASSIGNED

People

(Reporter: danielmicay, Assigned: jld)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta)

The ongoing work to implement a sandbox on Linux is using seccomp-bpf. This could be improved by adding the usage of a chroot and namespaces for an extra layer of protection. It's not infeasible for imperfections to slip into the system call whitelist (possibly due to new kernel features/bugs), or for a trusted file descriptor to be leaked into the sandbox.

Chromium's `chrome-sandbox` binary enters an empty chroot, process namespace and network namespace. This prevents filesystem access, sending signals to other processes and networking even without a system call whitelist in place.

The `chroot` call can be done by making a tiny binary calling `chroot("/usr/lib/firefox/empty_root_controlled_directory"); chdir("/")` with `CAP_SYS_CHROOT` set. It should not do anything fancy, as security vulnerabilities in the binary would allow processes to escape from a `chroot`.

When full support for `CLONE_NEWUSER` becomes available in Linux distributions (`CONFIG_USER_NS`), an `unshare(CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET)` call will become possible without privileges. This could be viewed as an optional feature, by making the call but not viewing failure as fatal. The other option would be adding the `CAP_SYS_ADMIN` capability (or `setuid`, much like `chrome-sandbox`) to call `unshare(CLONE_NEWPID|CLONE_NEWNET)`.
The Chromium suid sandbox code could be reused, much like the current shared usage of the seccomp wrangling code:

https://code.google.com/p/setuid-sandbox/
(The first few comments on bug 1041885 can mostly be disregarded; we can and will require seccomp-bpf for sandboxing, which means that there's no need to, e.g., work around glibc's lack of fdlopen(3) without being able to intercept open(2).  Also, my apologies for not seeing this bug when it was filed; I didn't follow Core::Security then, and the category doesn't seem to be actively triaged.)

Depending only on seccomp-bpf is… not ideal — see bug 1066750 for a worked example of why — but the policy for media plugins is restrictive enough that it should work.  But, yes, using the OS's actual resource access isolation features would be great.

Interestingly, even if we have unprivileged user namespaces, we can't just replace the fork in process_util_linux with a clone: not only does this leave the child with the wrong idea of its own pid/tid, but it doesn't run pthread_atfork hooks… like the one our malloc (and probably most others, I'd guess) uses to allow itself to be called in forked children without deadlocking (strictly speaking not allowed by POSIX, but I'm guessing this is one of those pave-the-cowpath things that implementers can't get away with actually enforcing).  We also can't use unshare(CLONE_NEWUSER|CLONE_NEWPID) on a temporary thread to make a fork()ed child enter a new namespace, because only a single-threaded process can unshare the user namespace.  So we basically have to double-fork, and the intermediate process should exec to not retain lots of unwanted CoW memory from the parent: and now this sounds a lot like what the setuid sandbox already does.

(In the interest of completeness: the intermediate process could also exit, but that would reparent the actual child to init(8)… unless the parent uses PR_SET_CHILD_SUBREAPER, but then it becomes responsible for waiting on all orphaned descendents, including from NPAPI plugins or other code I'm not thinking of yet.  (Of course, if anything uses NSPR to start a child process, then bug 227246 comes into play and the zombie orphans will be collected, but I'd like to fix that bug some day.)  So this is probably not a good idea.)

So this raises the possibility of modifying the setuid sandbox to be a userns sandbox if run without root, and using it more or less as-is.  This has the nice advantage that distributions supporting kernels with no userns (Ubuntu 12.04's 3.2 series, maybe RHEL 5/6 and/or Debian wheezy if they pick up the ESR 38 branch) or which disallow unprivileged userns by default (Debian jessie/sid, Arch, …) could opt into namespace/chroot sandboxing that way.
Assignee: nobody → jld
Blocks: 1021232
I have user/pid/net namespaces for media plugins semi-working (it tends to deadlock in actual usage; see above re malloc), enough to break a bunch of things:

* GeckoChildProcessHost contains a reimplementation of the bad assert seen in https://crbug.com/21112; in general, the pid from the IPC hello message is now meaningless and should be ignored.

* The BroadcastSetThreadSandbox hack I wrote to start seccomp in a multithreaded process is reading a procfs from the outer namespace and tgkill()ing in the inner namespace.  I think I can get rid of it entirely, which would also help for integrating more of the Chromium support code than just CodeGen.  This needs its own bug.

* Debugging, because gdb tries to use thread ids it reads from the process's libpthread structures.  See https://lkml.org/lkml/2009/2/20/212, but it's actually not hard to launch a process into the namespace; my proof-of-concept is https://gist.github.com/jld/e2cdae1296d39e57523c, and it turns out that gdb has no problems attaching to pid 1, at least as long as it's not a descendant of pid 1.

Also, in the presence of double-forking as mentioned above:

* The parent process should waitid/waitpid on the pid it gets back from fork(), but if it tries to SIGKILL that process, it will leave the actual plugin-container running.  The right pid to kill is the one from SCM_CREDENTIALS on the IPC socket (except on Linux older than 2.6.35, where that's buggy, but we can decline to support those; Chromium does, now, for exactly this reason).
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
A process namespace's pid 1 will act as an init process (it's expected to reap orphans in the namespace), so killing that will reliably wipe out all of the processes in the namespace. It's possible to stick to clone + exec without unshare as long as care is taken to avoid hitting stuff like the global allocator in the window before exec, but it's probably not going to have much impact on performance.
Depends on: 1088387
(In reply to Daniel Micay from comment #5)
> A process namespace's pid 1 will act as an init process (it's expected to
> reap orphans in the namespace), so killing that will reliably wipe out all
> of the processes in the namespace.

Indeed.  The problem is if we wind up with this:

  [firefox]--[useless process]--[plugin-container]--(stuff)

where the plugin-container is the pid namespace leader, and the IPC code thinks it should kill the useless process.

(The plugin-container actually can't reap orphans, because it's not permitted to use any of the wait* syscalls, but it also can't create new thread groups, so the only orphans would be from injecting processes with setns(2).  gdb and gdbserver, for reasons unknown to me, leave a zombie orphan in that case; this is irritating but seems harmless.)

> It's possible to stick to clone + exec
> without unshare as long as care is taken to avoid hitting stuff like the
> global allocator in the window before exec, but it's probably not going to
> have much impact on performance.

I took a closer look at LaunchApp.  Apparently bug 622992 mostly fixed this… except for setenv; and then bug 772734 fixed the setenv… but only on B2G.  The problem, if I understand correctly, is that we can't safely read `environ` if another thread could be changing (and freeing!) it concurrently.  And there's a fix (bug 773414), but it needs to be upstreamed to NSPR.  Which isn't impossible if we actually need it.

On the other hand, being able to offer the option of the setuid-root sandbox without making (and maintaining) two sets of annoying changes to IPC could be valuable, if there were demand for it.

Also, if anyone was wondering how hard it is to use the Chromium setuid sandbox for something that isn't actually Chromium:

$ env SBX_CHROME_API_RQ=1 /usr/lib/chromium/chrome-sandbox /bin/sh -c \
  'echo Before:; ls -ld /; printf C >&$SBX_D; read x <&$SBX_D; echo After:; ls -ld /' 7</dev/null
Before:
drwxr-xr-x 27 root root 4096 Oct 20 14:05 /
After:
/bin/sh: 1: ls: Permission denied
Another consideration is that Mozilla's continuous integration infrastructure runs Linux tests on Ubuntu 12.04 with the 3.2 kernel series.  Those won't do unprivileged userns, but the setuid sandbox would work, assuming it can be added to the VM images.
I have a proof-of-concept working along similar lines to the bottom of comment #6, but this is where I have questions about IPC.  It will wind up like this:

firefox(15789)─┬─chrome-sandbox(15874)───plugin-containe(15875)─⋯
               ⋮

Currently, if the parent tries to kill the child, it will kill the chrome-sandbox and the plugin-container will continue whatever it was doing.  This is bad.  The parent can get the plugin-container's actual pid[1], but it still needs to waitpid the chrome-sandbox or we'll accumulate zombies.  (I think Chromium itself collects dead children in a SIGCHLD handler, so it may not be the most useful example here.)


[1] Preferably by doing the SO_PASSCRED/SCM_CREDENTIALS dance.  The child should be able to find its own pid relative to the parent's namespace by readlinking /proc/self before it chroots, and if the parent can absolutely trust the pid in the IPC hello message then it could be sent that way, but if possible I'd rather not.
Flags: needinfo?(bent.mozilla)
But I also have this working, including with a makeshift chroot helper, by clone(2)ing directly from process_util_linux.cc and using user namespaces.  No changes to child process killing/waiting, but needs some hooks into process_util's forking / environment handling, and of course there's a patch to upstream to NSPR.  (And this will have to #error if it's a --with-system-nspr that's too old to have that patch, I guess.)  Probably easier to land, all told... but it leaves the Linuxes without unprivileged user namespaces (including, as mentioned, Mozilla's CI until the unspecified point in the future when it upgrades) out in the cold.
Flags: needinfo?(bent.mozilla)
Move process sandboxing bugs to the new Bugzilla component.

(Sorry for the bugspam; filter on 3c21328c-8cfb-4819-9d88-f6e965067350.)
Component: Security → Security: Process Sandboxing
Depends on: 1129492
No longer depends on: 1132760
No longer depends on: 1133073
This is now a meta bug, because it covers lots of things that aren't all going to be done at once.

Not yet properly broken out: desktop content processes; that needs more thought about how to deal with bug 1129492 and ordering graphics initialization with the various parts of sandbox initialization.
Keywords: meta
Summary: defence in depth for the Linux sandbox (namespaces, chroot) → defense in depth for the Linux sandbox (namespaces, chroot)
Priority: -- → P3
Summary: defense in depth for the Linux sandbox (namespaces, chroot) → [meta] defense in depth for the Linux sandbox (namespaces, chroot)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.