As we spin up more content processes we'd like to improve the amount of shared data across processes. On Linux, read-only portions of the binary such as .text and .rodata can be shared, but portions that must be relocated such as .data.rel.ro cannot. .data.rel.ro accounts for ~4MB of unsharable data. vtables account for a fair amount of this data, and while we do have bugs on file for reducing that amount of vtables, that's a rather tedious process with diminishing returns. Instead I propose implementing a system that loads a minimal content process (essentially just a main loop) that is then used to fork real content processes. This should give us a sizeable memory win for the relocations as well as other possibilities for sharing memory pages marked as copy-on-write. Prior art can be found in Chrome's zygote process  as well as our previous attempts of Nuwa for B2G . I'm proposing a less aggressive version of Nuwa in that we would perform the fork before initializing XPCOM and avoid dealing with threading, mutex, polling, etc. We might be able to get larger wins by initializing some of our core libraries such as ICU, NSS, libav, and portions of SpiderMonkey prior to forking. Additionally if we can implement something that works for mac as well we'd see at least a 15MB improvement. I'm filing this in IPC, but it clearly has implications on sandboxing and xpcom as well.  https://chromium.googlesource.com/chromium/src/+/master/docs/linux_zygote.md  https://wiki.mozilla.org/NuwaTemplateProcess
Jed, when you get a chance can you sketch out some of your thoughts on this?
In theory this should be a perf win as there's less initialization required. The Chrome folks measured ~56ms/GHz .  https://chromium.googlesource.com/chromium/src/+/master/docs/linux_zygote.md#appendix-a_runtime-impact-of-relocations
Whiteboard: [overhead:>4MB] → [overhead:>4MB][qf]
For a basic proof-of-concept, we should be able to hook in early in main() to check for a command line flag or env var and, without starting threads (or using XPCOM, probably) run a little server that receives packets containing: 1. a list of fds (as SCM_RIGHTS) and a list of destination fds to map them to 2. environment variable settings 3. argv 4. [reserved for future expansion] I think the IPC Pickle / ParamTraits stuff can be safely used to deserialize the data, but the fd passing would have to be hand-written. At the risk of stating the obvious: it then forks, and the child applies the fd mapping (see , although the CloseSuperfluousFds is a little unnecessary here) and sets the env vars (setenv is safe, because single-threaded), and continues with the provided argv; the server would send back the pid or error. This server would be launched normally with LaunchApp (maybe lazily the first time it's needed?) and GeckoChildProcessHost::PerformAsyncLaunchInternal would use it instead. On IRC I suggested adding options to LaunchApp, but on further thought I think it makes more sense just to write something specialized. Things that are broken with this: * Sandboxing as it currently exists can work, but at the moment it's factored kind of badly for this — we just want to send down the SandboxFork constructor params, but that's all abstracted inside SandboxLaunch and hidden behind the ForkDelegate abstract class. (Those params are computed by poking at a lot of XPCOM stuff in the parent process; that part needs to stay where it is.) * Sandboxing in the future was (at some point) going to allow launching processes via a setuid helper for distributions like Debian and Arch and RHEL7 that don't allow unprivileged user namespaces by default. Chrome appears to handle this by sandbox-launching the entire zygote, which also means the renderers *start* without filesystem access if I'm reading the code correctly (among other quirks). Not insurmountable, but definitely makes this harder. Alternately, those setups could take the memory overhead of per-process ASLR. * Waiting for processes to exit. On Linux the server could use CLONE_PARENT to create a sibling instead of a child; portably, it could handle it as a second RPC message. (I wouldn't mind throwing out and rewriting the child process watcher code.) * Thread creation at initializer time. This can happen if people follow NVIDIA's advice about multithreaded GL, which isn't needed for Firefox; we could detect that and scrub LD_PRELOAD. In general we'd want to be able to detect this and fall back; I don't know if there's anything more portable than interposing pthread_create. (On Linux there's a trick with the link count of /proc/self/task, but the Tor Browser people want to run with /proc unmounted. On the other hand they might also want to sacrifice memory for per-process ASLR.) TSan also creates extra threads, but we can just turn this all off. * Not exactly broken, but doing a blocking read on the I/O thread to wait for the pid isn't ideal. Making that async or moving it to a dedicated thread would be nice; this is entangled with making the main thread not sync wait to get the pid from the I/O thread. * Mac, maybe. I've heard that fork-without-exec can cause problems involving Mach ports, but I don't understand the details and whether it applies to us / if there's some initialization we could defer to prevent it. (Mac sandboxing doesn't need any magic at launch time.) A thing that is good: * This also means that we're not forking the parent process, which imposes time costs proportional to how much writable private memory it has, which is usually a lot. I wanted to do something about this anyway. (Corollary: that blocking read to get the pid might actually be less jank than forking directly.) The other idea I mentioned on IRC was using mozglue/linker to do the loading and modifying it to use shared memory (or MADV_MERGEABLE?) for the relocated things. That would be ELF-only (and Linux-only with KSM), but it avoids some of the fork-related problems. Also there might be reasons we can't or shouldn't do our own loading on desktop.  https://searchfox.org/mozilla-central/rev/93d2b9860b3d341258c7c5dcd4e278dea544432b/ipc/chromium/src/base/process_util_linux.cc#34-54
I'm hearing a lot of talk about Linux (and maybe Mac), but none about Windows...and our platform priorities run in roughly the opposite direction. I guess we would win on...Android?
(In reply to Nathan Froyd [:froydnj] from comment #4) > I'm hearing a lot of talk about Linux (and maybe Mac), but none about > Windows...and our platform priorities run in roughly the opposite direction. > I guess we would win on...Android? The main issue we're trying to solve here is relocated data not being sharable across processes. That isn't a problem on Windows, because relocations are shared across separate processes. It is a big problem on Linux and OS-X, though, and we can't really ignore it there. Same goes for Android.
Android/GeckoView is… interesting. We're currently launching child processes as Android services, which means that we already have Android Runtime stuff when we're started (so, probably threads), and if we want N content processes we'd have to declare ≥N services in an XML file. At present we support only one content process. It's apparently also possible to use fork/exec, but there's concern that this isn't really supported and whatever we do with that could be arbitrarily broken by OS updates. Also, exec'ing means no Android Runtime, which means no way to get a GL context, which means we'd have to do WebGL remoting, which Chrome (last I heard) does on desktop but *not* on mobile because of the overhead. (This is all secondhand from :snorp; I hope I haven't mangled it too much.)
(At this point, this doesn't sound like it's in the [qf] umbrella, but feel free to renominate with more details if needed. Knee-jerk triage decision: there will be lots work around fission to avoid incurring perf regressions as we increase the number of content processes, and that's all worthwhile work, and we also don't want the [qf] project to scope-creep to encompass all of that work.)
Whiteboard: [overhead:>4MB][qf] → [overhead:>4MB][qf-]
(In reply to Daniel Holbert [:dholbert] from comment #7) > (At this point, this doesn't sound like it's in the [qf] umbrella, but feel > free to renominate with more details if needed. Knee-jerk triage decision: > there will be lots work around fission to avoid incurring perf regressions > as we increase the number of content processes, and that's all worthwhile > work, and we also don't want the [qf] project to scope-creep to encompass > all of that work.) I don't think this is scope creep. This is a project that benefits both memshrink and qf in unrelated ways. It benefits memshrink by allowing us to share relocated data (and some data touched by static initializers) between child processes. It benefits qf by making it much cheaper/faster to spawn new content processes, and, importantly, moving the janky fork() step from the parent process (where it's user-visible) to the fork server (where it's not).
For what it's worth, moving the fork() to a dedicated server with a minimal amount of private writable data should greatly decrease the amount of jank (and CPU usage), as well as moving it. There are plans (bug 1348361, bug 1461459, bug 1446161) to stop making the main thread block waiting for the I/O thread to finish the launch operation; it may also be possible to move that off the I/O thread so it doesn't block IPC message passing either. But it's a little more complicated. Profiling on Linux, I'm seeing a gap in samples from the parent process main thread in LaunchSubprocess, flanked by pthread_cond_wait blocking on the I/O thread. I'd understand that if I were profiling the I/O thread as well, because it blocks SIGPROF in order to ensure it can make progress on forking and I believe that will hang the entire profiler for the duration… but I'm not doing that. So this suggests that the entire process gets suspended (either explicitly or as a side effect of blocking in page faults) in order to remove write permissions and do TLB shootdown. In any case, I'm seeing 11ms of jank there in a test profile, and it would probably be more in a heavily used browser, and offloading the fork() to another process is the only real solution. Also, the parent process is going to take an ongoing performance hit as it incurs page faults to flip the momentarily copy-on-write memory back to writeable. I've observed this with perf(1) but I don't have numbers at the moment; I remember the total time was on the same order of magnitude as the fork itself. On Mac the situation is different: we're using posix_spawn, which in theory doesn't need to do anything like fork() and can just create the new process /de novo/, but I haven't tried profiling it yet. tl;dr: this is a jank problem on Linux (and async launch probably won't help), it may not be on Mac but there's no data yet, and Windows is out of scope for this bug (see comment #5).
glandium points out in bug 1480401 that we may need SandboxFork to call pthread_atfork handers to use it like this. The fork server will definitely be single-threaded (unless we're using TSan, but in that case sandboxing is disabled and the real fork() will always be used), so the usual problems with multithreaded fork don't apply, but there might be something.
Assignee: nobody → jld
You need to log in before you can comment on or make changes to this bug.