Closed Bug 1446161 Opened 6 years ago Closed 6 years ago

Make ContentParent::PreallocateProcess asynchronous

Categories

(Core :: DOM: Content Processes, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla65
Performance Impact high
Tracking Status
firefox65 --- fixed

People

(Reporter: Alex_Gaynor, Assigned: jld)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [fxperf:p1])

Attachments

(3 files, 1 obsolete file)

Once bug 1348361 lands, we'll be spawning content processes from a different thread, but calls to OtherPid() can block. Right after we kickoff the process spawn, we call InitInternal(), which calls OtherPid().

We should look into deferring the calls to OtherPid() for as long as possible (ideally until an OnProcessLaunched callback :-)).

As a component of this, we'll probably need to move around the telemetry for "process spawn time" and make sure it's measuring what we really want it to measure.
I think this should inherit the qf status of bug 1348361.
Whiteboard: [qf]
Blocks: 1357242
Assignee: nobody → agaynor
Depends on: 1448125
Whiteboard: [qf] → [qf:p1][qf:f64]
Priority: -- → P2
Whiteboard: [qf:p1][qf:f64] → [qf:p1:f64]
Assignee: agaynor → felipc
Status: NEW → ASSIGNED
Whiteboard: [qf:p1:f64] → [qf:p1:f64][fxperf:p1]
Assignee: felipc → jld
I tried looking at this.  I wondered if we could maybe apply stack-ripping to the call graph leading up to the process launch, and turn the continuations that want the pid into closure chains, but it winds up going into nsFrameLoader and there's a lot of code there and something would probably go wrong.

But on the other hand: Why do we need the pid?  On Unix we actually don't — a serial number assigned by the parent process would work just as well (except for the Linux sandbox policy, but that could be made async).  On Windows we need either the pid or handle to send handles, either explicitly in messages or to set up new endpoints, but those messages don't need to actually be sent until the process is launched.  If the pid/handle isn't immediately available it would be necessary to duplicate the handle to be passed a second time, to get an owned handle that can be enqueued; and, if/when we move process launch off the I/O thread, we'd need to not try to service I/O readiness on a channel until its process is fully launched.

But that's still not easy; it means changing the notion of process ID that's used in at least some of the IPC code.  (Also, it's not completely impossible that there's a use of pids that I missed.)

--

And then there's the way people probably thought this would go, which is to go and pick out call chains to OtherPid one at a time and find ad-hoc ways to make them asynchronous and defer any dependencies on (e.g.) instance variables that aren't set because they're about actors that can't be created yet.  At first that seems as if it ought to be easier than a grand unified solution, but it could also be a disaster given all the code that's involved and everything it could reach indirectly.

--

Right now I'm leaning towards trying to change the nature of pids.
See Also: → 1473156
See Also: → 1474681
I thought about this a little more while filing bug 1475382, and the “virtual pid” approach is actually worse than I thought: when serializing a HANDLE on Windows, we'd need to write *something* into the buffer, and then keep the offsets and have a hook for the message to go back and patch itself when it's actually sent.  That's possible but it's kind of messy, and that's on top of the complexity of managing multiple types of process identifiers.

But also, I notice that preallocating a content process would be relatively easy to do asynchronously.  That would still leave the fallback case, where the frame loader needs a process *now*, but maybe people who know more about that area of the DOM could help with that.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #2)
> I wondered if we could maybe apply stack-ripping
> to the call graph leading up to the process launch, and turn the
> continuations that want the pid into closure chains, but it winds up going
> into nsFrameLoader and there's a lot of code there and something would
> probably go wrong.

It turns out this was actually a good idea.  In the current e10s model it can be applied relatively easily to the preallocated case, which is hopefully what we hit most of the time.  And the plan for Fission is, as I understand it, to redo the nsFrameLoader / GetNewOrUsedBrowserProcess path: throw out the “or used” part, and make loading the initial location act like an asynchronous navigation away from an in-process about:blank.

So I implemented async preallocation using MozPromise.

There are still some bugs to be worked out; my first try run failed a lot because content processes were marked as alive on construction, but now there's a preemption point between that and when the channel is opened.  Fixing that turned up some problems on shutdown.  But it looks tractable.

As for OtherPid, this *does* defer all calls to it until after the launch promise is resolved, like what comment #0 wanted.  Because otherwise they'd crash, because I've removed the existing async launch prototype and OtherPid is once again nonblocking.
Priority: P2 → P1
Summary: Defer calling ContentParent::OtherPid after async launching → Make ContentParent::PreallocateProcess asynchronous
Currently my prototype mostly seems to be working, but there are a lot of debug-build failures on Try: either because of massive memory leaks including at least one ContentParent, or asserting because NSS failed to shut down (possibly something holding it open), and "ASSERTION: Component Manager being held past XPCOM shutdown" is also seen.  So this points to a process being leaked mid-launch somehow.

I thought it might have been a race between shutdown and an in-progress launch, so I tried using an async shutdown blocker, but that caused more shutdown crashes because the blocker never un-blocked.  That implies that either the RunPerformAsyncLaunch runnable sent to the I/O thread was never run (which shouldn't be possible without an assertion failure, unless there's an edge case that's not covered) or the main thread failed to receive the promise `Then` runnable (which shouldn't be possible because the shutdown blocker should have been applied early enough that event loops still work).
Oh.  MozPromiseHolder::Re{solve,ject} clear the promise pointer, so that if your Ensure()->Then() happens before them then it works, but afterwards it chains onto a new promise that will never resolve.  (In this case, the main thread was preempted in ContentParent::LaunchSubprocessAsync until after the I/O thread was already finished launching.)  This seems like a footgun, so I filed a bug.

This comment brought to you by rr chaos mode.
Attached patch WIP (2018-08-29) (obsolete) — Splinter Review
Work in progress.  This finally seems to pass Try, but it needs some cleanup (and to be assembled into a number of patches greater than 1 but less than what's on my local branch).
Attachment #9005092 - Flags: review-
See Also: → 1126681
Are you still working on this, jld?
Flags: needinfo?(jld)
The CONTENT_PROCESS_LAUNCH_TIME_MS histogram is currently gathering times
from two different spans of the launch process and mixing them together;
it's at best a rough approximation of "launch time".

In addition, with async launch we'll want to gather different metrics
than for sync launch (see comments on bug 1474991); therefore, I've
taken the opportunity to create a new histogram for sync launch time.

Depends on D8940
The first attempt at async launch tried to hide the asynchrony inside
IPC, by making the process seem to be launched enough to construct new
channels and send it messages, and lazily blocking on the pid/handle.
Unfortunately, in practice we wind up needing the pid/handle immediately,
and this requirement is too deeply embedded in IPC for that to be viable.

Depends on D8941
There are several layers to this patch:

1. GeckoChildProcessHost now exposes a promise that's resolved when
the process handle is available (or rejected if launch failed), as a
nonblocking alternative to LaunchAndWaitForProcessHandle.

2. ContentParent builds on this with the private method
LaunchSubprocessAsync and the public method PreallocateProcessAsync;
synchronous launch continues to exist for the regular on-demand launch
path, for the time being.

3. PreallocatedProcessManager now uses async launch, and handles the new
"launch in progress" state appropriately.

Depends on D8942
Attachment #9005092 - Attachment is obsolete: true
Flags: needinfo?(jld)
Attachment #9017747 - Attachment description: Bug 1446161 - Replace CONTENT_PROCESS_LAUNCH_TIME_MS telemetry with CONTENT_PROCESS_SYNC_LAUNCH_MS. → Bug 1446161 - Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry.
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/37bf52f0e9cf
Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry. r=mconley,chutten
https://hg.mozilla.org/integration/autoland/rev/1bdf64b29121
Remove the earlier attempt at async launch. r=spohl,mccr8
https://hg.mozilla.org/integration/autoland/rev/80116391b7fe
Asynchronously launch preallocated content processes using MozPromise. r=mccr8
Backout by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b0e69b136883
Backed out 8 changesets (bug 1446161, bug 1487287, bug 1488993, bug 1474991, bug 1496608) for very frequent automation.py crashes on a CLOSED TREE
The backout is because of the assertions I mentioned in https://phabricator.services.mozilla.com/D8943#inline-53846, but I think the case where they fail was introduced in bug 1487287; it looks like the I/O thread can get woken up and do OnChannelConnected before the launch thread returns from launching the process.  (See also bug 1496608 comment #3 which links to a failure.)

I did push the entire stack to Try after adding those assertions, but maybe I needed to run more tests to catch this.

It's possible that the rest of this could be relanded safely, without bug 1487287, but I'll want to do some more testing to verify that.
I did some Try runs.  Win10 ASan seems to be the most reliable way to reproduce the assertion failure, but 3x mochitests got me only two of those:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e24fe4cc9bc7fd3af10f4e8fe16198cea7379229&selectedJob=214407005

So I did 3x and then another 5x: 

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6eff48a0a73bc2379ab372e3935928e01226b522
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e3fd17b6d744cf06fc01470481c83b535ef155d4

And all clear so far; the failures all look unrelated.  I'm going to try to reland.
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/25b4fcd4382d
Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry. r=mconley,chutten
https://hg.mozilla.org/integration/autoland/rev/3cbd2cf23c86
Remove the earlier attempt at async launch. r=spohl,mccr8
https://hg.mozilla.org/integration/autoland/rev/9f5e62b65302
Asynchronously launch preallocated content processes using MozPromise. r=mccr8
https://hg.mozilla.org/mozilla-central/rev/25b4fcd4382d
https://hg.mozilla.org/mozilla-central/rev/3cbd2cf23c86
https://hg.mozilla.org/mozilla-central/rev/9f5e62b65302
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Performance Impact: --- → P1
Whiteboard: [qf:p1:f64][fxperf:p1] → [fxperf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: