[meta] Improve process launch speed (time spent in GeckoChildProcessHost)

NEW
Assigned to

Status

()

enhancement
11 months ago
3 months ago

People

(Reporter: jld, Assigned: jld)

Tracking

(Depends on 3 bugs, Blocks 2 bugs, {meta})

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Fission Milestone:Future)

Details

In bug 1446161 and related bugs we've been trying to make process launch more asynchronous and defer dependencies on having a pid/handle for it, but there are some other ways to save time without needing deep changes to IPC.

On Mac, we should be able to remove almost all of the launch time by fixing bug 1473156.  It's similar to the general async launch issue but it's specific to Mach IPC and it could be fixed relatively easily if we can change to shm_open for shared memory.

On Linux, bug 1470591 (the fork server) is *not* easy, but given the per-process memory impact we'll be doing it eventually, and it should remove essentially all of the time overhead that async launch would make async.

Also on Linux, bug 1439412 (remote fontconfig) imposes significant time costs that async launch *wouldn't* help, also has significant per-process memory impact, and allows some lower-priority sandbox improvements.  It's probably not easy, and needs graphics/layout expertise.

That leaves Windows, where I did some profiling: CreateProcessAsUser itself is distressingly slow (several milliseconds) but a majority of the time is spent in the sandboxing code.  (Warning: the total times I see locally are a lot lower than Telemetry, so my results may not be representative.)  Specifically, about half the total launch time (including the main-thread part) is building the sandbox broker policy; this work can probably be cached, like what I did on B2G.  I'll file some bugs.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #0)

> That leaves Windows, where I did some profiling: CreateProcessAsUser itself
> is distressingly slow (several milliseconds) but a majority of the time is
> spent in the sandboxing code.  (Warning: the total times I see locally are a
> lot lower than Telemetry, so my results may not be representative.) 

On Windows I've already seen profiles where starting a content process was taking more than a second. I attributed this extreme slowness to the presence of antiviruses that (I guessed) scanned all new processes. That's what convinced me content process creation should be async.
Depends on: 1474991
(In reply to Florian Quèze [:florian] from comment #1)
> On Windows I've already seen profiles where starting a content process was
> taking more than a second. I attributed this extreme slowness to the
> presence of antiviruses that (I guessed) scanned all new processes. That's
> what convinced me content process creation should be async.

It would help to have a profile of one of those launches to see where we're spending that time.  According to telemetry, 1% of launch times on Windows Beta[*] are >1s, but we don't have data on what part of the launch process might be responsible for that.  I filed bug 1474991 for getting more data so we can at least get a statistical idea of which general part(s) of launch the long tail is.


[*] We also have data for release, but it's not available through the user-friendly public histogram viewer yet.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #2)
> (In reply to Florian Quèze [:florian] from comment #1)
> > On Windows I've already seen profiles where starting a content process was
> > taking more than a second. I attributed this extreme slowness to the
> > presence of antiviruses that (I guessed) scanned all new processes. That's
> > what convinced me content process creation should be async.
> 
> It would help to have a profile of one of those launches to see where we're
> spending that time.

I had a profile in bug 1348361 comment 52 but unfortunately without libxul symbols.
(In reply to Florian Quèze [:florian] from comment #3) 
> I had a profile in bug 1348361 comment 52 but unfortunately without libxul
> symbols.

I remember that profile now.  Looking at it again: that time is definitely spent waiting on PerformAsyncLaunchInternal, but the I/O thread ("Gecko_IOThread") isn't profiled by default so we can't tell if it's the CreateProcess itself or something that the sandbox is doing.  If we had a profile for that thread, even with just system symbols and not libxul, that could tell us things.

Also, my concern in bug 1348361 comment #54 that there might be other important threads for the sandbox seems to be unjustified, at least going by the profiling I've done locally.

(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #2)
> [*] We also have data for release, but it's not available through the
> user-friendly public histogram viewer yet.

I did some quick hacky things with sql.t.m.o and CROSS JOIN UNNEST and postprocessing with awk and copying numbers from TelemetryHistogramData.inc.  Assuming I got this right, it looks like on release it's about 3% of launch times that are over 1s, vs. 1% on beta.  (More data: the 99th percentile is 3s vs. 1s, and the 95th is 0.6s vs 0.3s, but the quartiles are about the same; so, the long tail is wider.)

I also tried measuring the part of the Windows userbase that's reported a >1s launch time at least once; that's more like 5%.  Put another way: more people than the entire Linux userbase (as known to Telemetry).
There are some profiles in bug 1438385 which might be interesting.

I have access to a couple of computers where cold startup is really slow (especially if you start Firefox right after booting), I'll try to get profiles there. One of them has AVG, the other only the Windows built-in antivirus.
(In reply to Marco Castelluccio [:marco] from comment #5)
> There are some profiles in bug 1438385 which might be interesting.

I notice a few places where we spend noticeable time (a few milliseconds) starting threads in the parent process; in that case the time is spent waiting for the new thread to make enough progress to initialize an event loop or something like that.  Those instances are all part of initializing DOM/graphics/etc. things, not the core process launching.  Also there seem to be a lot of missing samples, which I think means that the OS isn't scheduling the sampler thread as soon as it requested because of contention for CPU.

The longest process launch wait I saw was only 36ms, from the first profile in bug 1438385 comment #33: https://perf-html.io/public/d4c632e074c9f34f18fe9cc456d11dd28cfe4d40/calltree/?range=33.3008_33.3366&thread=0

And, like the profile in bug 1348361 comment 52, it's not very informative because Gecko_IOThread isn't included.
Summary: [meta] Improve process launch speed → [meta] Improve process launch speed (time spent in GeckoChildProcessHost)
Depends on: 1476101
https://perfht.ml/2m9E9aM comes from SUMO via :philipp on IRC #perf.  We can see the multi-second delays in LaunchSubprocess, but there's also a place (starting around 51s) where the main thread calls CreateProcess directly, from a JS function named “runMinidumpAnalyzer” in the crash reporter.  And the ~entire time is spent in NtCreateUserProcess.

I was hoping this would turn out to be either not CreateProcess at all, or at least some AV doing userspace hooking that would be visible in the profile so we could maybe see what it is and what it's doing, but no.

There's also another interesting item there in content process #4, around 48s: it's trying to connect to the crash reporting service in the parent process (CreateFileW on a pipe), but because of sandboxing that has to be brokered through the parent process… which is completely absent from profiling for 1.8s.  So the profiler sampler thread and the sandbox broker thread pool both fail to make progress, it seems.  Before the gap (and after, for one sample, but are we sure that wasn't captured beforehand?), the main thread can be seen in WriteMinidumpOnHandlerThread, doing what looks like the equivalent of sending a sync runnable.  I don't know enough to speculate what in all of that, if anything, might point to the problem.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #7)
> https://perfht.ml/2m9E9aM comes from SUMO via :philipp on IRC #perf.  We can
> see the multi-second delays in LaunchSubprocess, but there's also a place
> (starting around 51s) where the main thread calls CreateProcess directly,
> from a JS function named “runMinidumpAnalyzer” in the crash reporter.  And
> the ~entire time is spent in NtCreateUserProcess.

The main thread is blocked when the JS code calls nsIProcess::runAsync. Given the 'async' in the name of the method, I think it's pretty reasonable for the JS code to expect this to not block. Is there a specific reason why creating a process can't be done off main thread in that case (which seems pretty different from the common case of launching content processes)?
(In reply to Florian Quèze [:florian] from comment #8) 
> The main thread is blocked when the JS code calls nsIProcess::runAsync.
> Given the 'async' in the name of the method, I think it's pretty reasonable
> for the JS code to expect this to not block.

“Async” refers to how the process's termination is reported; contrast nsIProcess::run and its `blocking` parameter, and see bug 480427.  I don't think anyone expected that people would break the OS enough that just *creating* a process would block the caller for multiple seconds.

> Is there a specific reason why
> creating a process can't be done off main thread in that case (which seems
> pretty different from the common case of launching content processes)?

Someone would have to write the code.  nsProcess is already creating a helper thread (per process!) for runAsync, so moving the launch there wouldn't be too much extra work.  Accesses to the `pid` property would need to block, but CrashService.js doesn't appear to be accessing it immediately after launch (so we hopefully wouldn't wind up in the same situation as content processes).
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #9)
> (In reply to Florian Quèze [:florian] from comment #8) 

> > Is there a specific reason why
> > creating a process can't be done off main thread in that case (which seems
> > pretty different from the common case of launching content processes)?
> 
> Someone would have to write the code.

Ok, that was my guess. Sounds like a good [qf] bug to file then!
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #9)
> (In reply to Florian Quèze [:florian] from comment #8) 
> > The main thread is blocked when the JS code calls nsIProcess::runAsync.
> > Given the 'async' in the name of the method, I think it's pretty reasonable
> > for the JS code to expect this to not block.
> 
> “Async” refers to how the process's termination is reported; contrast
> nsIProcess::run and its `blocking` parameter, and see bug 480427.  I don't
> think anyone expected that people would break the OS enough that just
> *creating* a process would block the caller for multiple seconds.

I just encountered a Mac profile where nsIProcess::run(blocking = false) blocks the main thread for 14ms (on a developer machine, so probably a fast one) https://perfht.ml/2myFOqK
So it's not just on Windows that creating a process is expensive.

Updated

3 months ago
Fission Milestone: --- → Future
You need to log in before you can comment on or make changes to this bug.