Closed Bug 1401062 Opened 4 years ago Closed 3 years ago

Use clone() instead of fork() for sandboxed Linux processes and remove SandboxEarlyInit etc.


(Core :: Security: Process Sandboxing, enhancement, P2)




Tracking Status
firefox60 --- fixed


(Reporter: jld, Assigned: jld)


(Depends on 1 open bug, Blocks 1 open bug)


(Whiteboard: sb+)


(3 files)

Instead of starting child processes with fork() and then using unshare() early in startup while single-threaded to deal with namespace isolation (and gaining local capabilities for chroot()), we should create the child process and unshare all necessary namespaces in one step using clone().


* This enables pid namespace isolation (bug 1151624), while still having the child process as a direct child of the parent as Gecko IPC expects.

* Removing the kUnexpectedThreads workaround from bug 1222500, as well as any other IsSingleThreaded-related problems (e.g., bug 1291553).

* We can decide what namespace-related actions to take in a process that has all the XPCOM/Gecko services running, rather than a limited context that needs to be fed information via environment variables.  (For example, bug 1376910 will need an exception for a specific graphics driver, and as we enable more features we'll usually want to be able to control them with prefs rather than env vars.)


* We're bypassing fork() and therefore pthread_atfork(), so the post-clone path needs to be async signal safe and not cheat by using malloc.  However, we're already doing this — it's required by POSIX, and apparently we've had run-ins with broken pthread_atfork()s that forced the issue.

* If we just use syscall(__NR_clone, ...), we're bypassing the part of glibc that updates its cached copy of the pid and tid.  In principle, this shouldn't be a problem as long as we don't call anything like raise() — which means blocking signals across clone() and resetting signal handlers in the child before unblocking.  Alternately, Chromium uses glibc's clone() wrapper, which insists on calling a function on an alternate stack instead of behaving like fork(), by longjmp'ing out of the callback; we could borrow their code if we run into problems using the syscall directly.

Non-Features / Alternatives:

* This isn't useful in the absence of unprivileged user namespace support.  To handle that case, we'd need to run the child process indirectly via a setuid root wrapper.  In this case, namespaces could be also be handled by a wrapper executable (and avoid the multithreaded-fork issues).

* We're still doing a copy-on-write of the parent's entire address space; i.e., making the entire thing read-only and causing page faults on write accesses.  This is a fundamentally expensive operation, but as far as I know we don't have data on how significant it might be for Firefox.  If we wanted to avoid it, we'd need to run the child process indirectly via a small server process (the Chromium developers use the term “zygote”, and seem to consider direct use of fork/clone too expensive to even consider, for their use case).

For both of these, we'd need IPC's base::LaunchApp, or a replacement for it, to do blocking I/O to get the process's actual pid.  This can't wait until it gets the pid in the IPC Hello message; the launch-time pid will have already escaped into Endpoint<> objects by then.  Basically, that pid needs to match what the child gets from base::GetCurrentProcessId to keep the IPC framework happy, and both of those have to match the child's actual pid (in the parent's pid namespace, as observed via SCM_CREDENTIALS) for the crash reporter to work.  This isn't insurmountable, but we'd like to do *less* blocking in this path: see bug 1348361.  And there needs to be a story for waitpid() when the “child process” isn't an actual child process.

So my plan is to set things up to use clone() directly, and most of the related work (see “Advantages”, above) will carry over even if the core of the process creation winds up being rewritten in the future.  I already have this working; it mostly just needs to be cleaned up and reviewed.
Priority: -- → P1
Whiteboard: sb+
Priority: P1 → P2
A few notes:

We can't completely get rid of SandboxEarlyInit: for non-tsync kernels we need to know the signal number we're going to use for the signal “broadcast” before libraries like glib have a chance to block it.  (That could also be done in a static initializer, but we're trying to have fewer of those where possible.)  However, it's no longer necessary to be strictly single-threaded there.

It's possible to retain the old SandboxEarlyInit and dynamically control whether to use it or clone() with a pref, but this doesn't provide enough benefit to be worth it.  For troubleshooting, setting MOZ_ASSUME_USER_NS=0 will force fork(); that will affect content the same way as the hypothetical pref, although for GMP it will remove the existing namespace/chroot layer.  If we encounter regressions then we should fix them and uplift, and if it's too close to release for that (I don't think this will happen) we can just back out the patches on the branch.

And I'm going to use the longjmp() trick to use glibc's clone() wrapper, instead of doing the syscall directly: it's not much code, the approach seems to be sound given that it's been used for years by Chromium, and it avoids a class of footguns that would probably be difficult to debug.
Comment on attachment 8941332 [details]
Bug 1401062 - Delete the old namespace/chroot code and reorganize sandbox init.

::: security/sandbox/linux/Sandbox.cpp:301
(Diff revision 1)
>    syscall(__NR_futex, reinterpret_cast<int*>(&gSetSandboxDone),
>            FUTEX_WAKE, 1);
>  }
>  static void
>  EnterChroot()

I guess this can go now?
Attachment #8941332 - Flags: review?(gpascutto) → review+
Comment on attachment 8941333 [details]
Bug 1401062 - Create Linux child processes with clone() for namespace/chroot sandboxing.
Attachment #8941333 - Flags: review?(gpascutto) → review+
Comment on attachment 8941334 [details]
Bug 1401062 - Avoid doing sandbox-related things to unsandboxed child processes.
Attachment #8941334 - Flags: review?(gpascutto) → review+
(In reply to Gian-Carlo Pascutto [:gcp] from comment #5)
> Comment on attachment 8941332 [details]
> >  static void
> >  EnterChroot()
> I guess this can go now?

It could, but the function becomes non-empty again in the second patch.
Updated the commit message in the main patch to mention the force-namespace pref.

We're in the 59 soft code freeze, so I think I'm going to defer landing this until it's branched and the 60 cycle starts next Monday.
Pushed by
Delete the old namespace/chroot code and reorganize sandbox init. r=gcp
Create Linux child processes with clone() for namespace/chroot sandboxing. r=gcp
Avoid doing sandbox-related things to unsandboxed child processes. r=gcp
Depends on: 1480401
No longer depends on: 1480401
You need to log in before you can comment on or make changes to this bug.