Open Bug 1151624 (pid-namespaces) Opened 9 years ago Updated 2 years ago

pid namespace isolation for sandboxed Linux child processes


(Core :: Security: Process Sandboxing, defect, P3)




Tracking Status
firefox40 --- affected


(Reporter: jld, Assigned: jld)


(Blocks 2 open bugs)


(Whiteboard: sblc5)


(3 obsolete files)

The next thing to do with the unprivileged user namespace support from bug 1151607 is to start child processes in their own pid namespaces.

This will be a little more involved:

1. The child has to be cloned with CLONE_NEWPID, not just unshare it (that applies to the unsharer's next child instead, and prevents it from creating more children than that one), so this needs to change how child process startup happens — and will bypass pthread_atfork, which means avoiding allocation between fork and exec.

2. This also breaks the code that iterates procfs to find the process's threads to signal them all to start seccomp-bpf, because the pids are in the wrong namespace.  We could mount a local procfs, but since we're requiring unprivileged user namespaces it's not much more of an imposition to require seccomp thread synchronization.

3. getpid() will always return 1 in the child process.  This will cause some weirdnesses that don't have bugs yet; e.g., in everything that uses getpid() for a log message or log file name, or otherwises causes the pid to be seen outside the process itself.

3.1. Also, the little-known limitations on sending signals to init(8) apply; there's a one-line patch I need to send Breakpad, in case someone tries to kill -11 one of these processes.

4. gdb will fail to find threads of such a process other than the one it was attached to, because the subjective thread ids from pthread's internal structures aren't the objective ones it needs for procfs and ptrace and similar, and there is (amazingly) no interface for translating them.  It's a simple matter of scripting to run gdb (or, alternately, just gdbserver) inside the pid namespace with a locally mounted procfs referencing it.

(Most of this has been mentioned in bug comments at some point, but it seems useful to gather it here.)
Assignee: nobody → jld
Attached patch Work In Progress (2015-08-14) (obsolete) — Splinter Review
I have this sort of working for desktop GeckoMediaPlugins.  (This diff also includes a patch for bug 1133073.)  I haven't yet fixed the log messages that identify the process as [GMP 1], or the log file names I ran into last time (leak checking, I think?), but there's a routine to get the outer process ID so that should be relatively simple.

Probably the biggest problem is that my earlier fix for multithreaded debugging doesn't work: the child process re-unshares the user namespace to gain the capabilities it needs to use chroot and unshare the net/ipc namespaces, but in order to setns() the process's pid namespace, a process has to setns() into the user namespace that owns it, which is the now-inaccessible intermediate user namespace.  (Alternately, it needs to have CAP_SYS_ADMIN in the root namespace, which means being actually root.)

This will break --with-system-nspr until there's a release and is updated for it.

* Letting the child process keep an open file descriptor to that namespace (so that the debugger helper can open it via /proc/N/fd, assuming that works) is worrying but not an immediately obvious security hole… but why risk it when there's no benefit when not debugging?

* Running the child process as subjective uid 0, so it doesn't lose capabilities on exec, would avoid the second unshare but is also worrying re security, even though that would be followed by dropping capabilities and using securebits (and PR_SET_NO_NEW_PRIVS) to keep it that way.

* Having the parent keep that fd (and remember to close it when the child exits, and somehow export info on which one goes with which child, etc.) means a bunch of code that does fancy things that are useless when not debugging.

* Making any of the above debug-build-only is possible, but functional divergence between opt/debug builds usually turns out to be a bad idea.

* Rewriting everything that landed in bug 1151607 so that we do all the unsharing in the initial clone(), and do the chroot in a separate process like the chromium setuid sandbox (but not setuid).  This would also make chrooting on B2G a little easier, because I wouldn't need to defer the uid change until after child process startup.  (It's also not doable until we can clone() instead of fork()ing, which is the part that's been blocked on getting an NSPR release, so it's not like bug 1151607 was "wrong".)

* Requiring that this sandbox feature be disabled to do multithreaded debugging is not an ideal user experience for the developer, since they might not know about this until it's too late and they need to re-reproduce their bug.  But it's what Chromium has done since the beginning, and what Servo will be doing with their e10s-equivalent.
Assignee: jld → nobody
Whiteboard: sblc3
Moving to sblc5 which is about getting user namespaces working.
Whiteboard: sblc3 → sblc5
See Also: → 1322526
Priority: -- → P3
Alias: pid-namespaces
I've gotten this working again, using clone() in place of fork(), with some changes/cleanups in IPC which I'll be filing other bugs for.  I'm also going to split off a bug for the clone-vs-fork part of this, because getting rid of SandboxEarlyInit() and friends is useful for other reasons.

PID namespaces seem to work as-is for media plugins, but in content they cause problems with PulseAudio — the client sees every other process as nonexistent, so it deletes all of PulseAudio's shared memory files and breaks all audio until the daemon is manually restarted, even if WebAudio/WebRTC isn't explicitly used.  So I'll also be breaking off a bug for using this in content.

For the pid as used in log messages and log file names and IPC and so on, I've successfully used ELF symbol interposition to replace getpid() with a function that does readlink() on /proc/self if the actual pid is 1 (and caches it, so it continues to work after filesystem access is restricted).  So far I've found only two problems with this approach: the profiler, which we can fix, and the crash reporter, which doesn't do anything useful anyway because the process is pid 1 and will ignore the signal.  Once sandboxing is started, seccomp-bpf can be used to redirect any misaddressed signals.  This area needs a little more investigation — the alternative would be some kind of opt-in GetGlobalPid(), but we have a lot of ad-hoc use of getpid() scattered around, to say nothing of third-party libraries that we may not control, and they're far more likely to just want a unique (relative to host and point in time) identifier than to be using tgkill() or similar.

As for debugging, good news: GDB 7.9 uses /proc/%d/task instead of reading the process's libpthread internal structures to determine the thread list[*], so it now shows all threads and seems to actually work.

gdb does give a warning message about the namespace mismatch, and there may be issues I haven't run into yet, but the wrapper script I wrote years ago to inject gdb into an appropriate collection of namespaces (and which I'd check into tools/ somewhere alongside this) still works.  (Or any other command — you could start a shell inside this mini-container, run `ps -e`, and be entertained by “plugin-container” or “Web Content” as pid 1.)

Assignee: nobody → jld
Depends on: 1401062
Attached patch Work In Progress (2018-10-09) (obsolete) — Splinter Review
Current version of the patch; the 2015 version was extremely bit-rotted.  The profiler and crash reporter changes would remain separate for actual landing.
Attachment #8649015 - Attachment is obsolete: true
Updated so it doesn't explode when SECCOMP_FILTER_FLAG_TSYNC isn't available, which I guess I somehow forgot about when redoing this after bug 1401062.
Attachment #9015642 - Attachment is obsolete: true

Interposing getpid may not be the best idea — we've recently encountered code in Mesa that uses kcmp and needs the real getpid, and there may be other uses of similar APIs by system libraries that we aren't aware of or that may be added in the future.

I think this should instead add a function in MFBT or somewhere central like that, for use in cases that need the globally meaningful pid (IPC, log messages, etc.)

Attachment #9015685 - Attachment is obsolete: true

In addition to auditing users of getpid() (or base::GetCurrentProcId()) and fixing them as previously described, there's also the crash reporter: when it determines which thread is the crashing one, that thread ID is given by the crashing process and in the wrong namespace, so it won't match the thread IDs that the reporter gets from procfs. (I think; I haven't tried this recently.) One possible fix is to read the NSpid fields from /proc/{pid}/task/{tid}/status when iterating the threads, but that was added in 4.1 and it would be nice to have something more backwards compatible if possible. Chromium's use of Breakpad with pid namespaces predates that, so it would have had to have a solution for this (maybe it matches threads by stack pointer value?) which could maybe be adapted; I haven't investigated yet.

It would be nice if the crashing process could use an explicitly sent SCM_CREDENTIALS with the tid, because that will map them into the recipient's namespaces, but that doesn't work: if the sending process has CAP_SYS_ADMIN within the namespace then it can send any pid/tid in that namespace, but otherwise it can send only its own pid and we don't want the process to have capabilities (even within an unprivileged user namespace) while sandboxed in order to reduce kernel attack surface. (Also, capabilities are automatically dropped on exec.) This limitation is intentional.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.