Closed Bug 1446161 Opened 7 years ago Closed 6 years ago

Make ContentParent::PreallocateProcess asynchronous

Categories

(Core :: DOM: Content Processes, enhancement, P1)

Product:

Component:

Type:

enhancement

Priority:

P1

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla65

Performance Impact

high

Tracking Flags:

Tracking

Status

firefox65

---

fixed

People

(Reporter: Alex_Gaynor, Assigned: jld)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [fxperf:p1])

Attachments

(3 files, 1 obsolete file)

WIP (2018-08-29) 7 years ago Jed Davis [:jld] ⟨⏰\|UTC-8⟩ ⟦he/him⟧ 54.00 KB, patch	jld : review-	Details \| Diff \| Splinter Review
Bug 1446161 - Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry. 6 years ago Jed Davis [:jld] ⟨⏰\|UTC-8⟩ ⟦he/him⟧ 46 bytes, text/x-phabricator-request		Details \| Review
Bug 1446161 - Remove the earlier attempt at async launch. 6 years ago Jed Davis [:jld] ⟨⏰\|UTC-8⟩ ⟦he/him⟧ 46 bytes, text/x-phabricator-request		Details \| Review
Bug 1446161 - Asynchronously launch preallocated content processes using MozPromise. 6 years ago Jed Davis [:jld] ⟨⏰\|UTC-8⟩ ⟦he/him⟧ 46 bytes, text/x-phabricator-request		Details \| Review

Alex Gaynor [:Alex_Gaynor]

Reporter

Description

•

7 years ago

Once bug 1348361 lands, we'll be spawning content processes from a different thread, but calls to OtherPid() can block. Right after we kickoff the process spawn, we call InitInternal(), which calls OtherPid(). We should look into deferring the calls to OtherPid() for as long as possible (ideally until an OnProcessLaunched callback :-)). As a component of this, we'll probably need to move around the telemetry for "process spawn time" and make sure it's measuring what we really want it to measure.

Florian Quèze [:florian]

Comment 1

•

7 years ago

I think this should inherit the qf status of bug 1348361.

Whiteboard: [qf]

Updated

•

7 years ago

Blocks: 1357242

Alex Gaynor [:Alex_Gaynor]

Reporter

Updated

•

7 years ago

Assignee: nobody → agaynor

Alex Gaynor [:Alex_Gaynor]

Reporter

Updated

•

7 years ago

Depends on: 1448125

Mike Conley (:mconley) (:⚙️)

Updated

•

7 years ago

Whiteboard: [qf] → [qf:p1][qf:f64]

Jim Mathies [:jimm]

Updated

•

7 years ago

Priority: -- → P2

Jean Gong :jgong

Updated

•

7 years ago

Whiteboard: [qf:p1][qf:f64] → [qf:p1:f64]

:Felipe Gomes (needinfo for replies!)

Updated

•

7 years ago

Assignee: agaynor → felipc

Status: NEW → ASSIGNED

Whiteboard: [qf:p1:f64] → [qf:p1:f64][fxperf:p1]

:Felipe Gomes (needinfo for replies!)

Updated

•

7 years ago

Assignee: felipc → jld

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 2

•

7 years ago

I tried looking at this. I wondered if we could maybe apply stack-ripping to the call graph leading up to the process launch, and turn the continuations that want the pid into closure chains, but it winds up going into nsFrameLoader and there's a lot of code there and something would probably go wrong. But on the other hand: Why do we need the pid? On Unix we actually don't — a serial number assigned by the parent process would work just as well (except for the Linux sandbox policy, but that could be made async). On Windows we need either the pid or handle to send handles, either explicitly in messages or to set up new endpoints, but those messages don't need to actually be sent until the process is launched. If the pid/handle isn't immediately available it would be necessary to duplicate the handle to be passed a second time, to get an owned handle that can be enqueued; and, if/when we move process launch off the I/O thread, we'd need to not try to service I/O readiness on a channel until its process is fully launched. But that's still not easy; it means changing the notion of process ID that's used in at least some of the IPC code. (Also, it's not completely impossible that there's a use of pids that I missed.) -- And then there's the way people probably thought this would go, which is to go and pick out call chains to OtherPid one at a time and find ad-hoc ways to make them asynchronous and defer any dependencies on (e.g.) instance variables that aren't set because they're about actors that can't be created yet. At first that seems as if it ought to be easier than a grand unified solution, but it could also be a disaster given all the code that's involved and everything it could reach indirectly. -- Right now I'm leaning towards trying to change the nature of pids.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

See Also: → 1473156

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

See Also: → 1474681

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 3

•

7 years ago

I thought about this a little more while filing bug 1475382, and the “virtual pid” approach is actually worse than I thought: when serializing a HANDLE on Windows, we'd need to write *something* into the buffer, and then keep the offsets and have a hook for the message to go back and patch itself when it's actually sent. That's possible but it's kind of messy, and that's on top of the complexity of managing multiple types of process identifiers. But also, I notice that preallocating a content process would be relatively easy to do asynchronously. That would still leave the fallback case, where the frame loader needs a process *now*, but maybe people who know more about that area of the DOM could help with that.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 4

•

7 years ago

(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #2) > I wondered if we could maybe apply stack-ripping > to the call graph leading up to the process launch, and turn the > continuations that want the pid into closure chains, but it winds up going > into nsFrameLoader and there's a lot of code there and something would > probably go wrong. It turns out this was actually a good idea. In the current e10s model it can be applied relatively easily to the preallocated case, which is hopefully what we hit most of the time. And the plan for Fission is, as I understand it, to redo the nsFrameLoader / GetNewOrUsedBrowserProcess path: throw out the “or used” part, and make loading the initial location act like an asynchronous navigation away from an in-process about:blank. So I implemented async preallocation using MozPromise. There are still some bugs to be worked out; my first try run failed a lot because content processes were marked as alive on construction, but now there's a preemption point between that and when the channel is opened. Fixing that turned up some problems on shutdown. But it looks tractable. As for OtherPid, this *does* defer all calls to it until after the launch promise is resolved, like what comment #0 wanted. Because otherwise they'd crash, because I've removed the existing async launch prototype and OtherPid is once again nonblocking.

Priority: P2 → P1

Summary: Defer calling ContentParent::OtherPid after async launching → Make ContentParent::PreallocateProcess asynchronous

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 5

•

7 years ago

Currently my prototype mostly seems to be working, but there are a lot of debug-build failures on Try: either because of massive memory leaks including at least one ContentParent, or asserting because NSS failed to shut down (possibly something holding it open), and "ASSERTION: Component Manager being held past XPCOM shutdown" is also seen. So this points to a process being leaked mid-launch somehow. I thought it might have been a race between shutdown and an in-progress launch, so I tried using an async shutdown blocker, but that caused more shutdown crashes because the blocker never un-blocked. That implies that either the RunPerformAsyncLaunch runnable sent to the I/O thread was never run (which shouldn't be possible without an assertion failure, unless there's an edge case that's not covered) or the main thread failed to receive the promise `Then` runnable (which shouldn't be possible because the shutdown blocker should have been applied early enough that event loops still work).

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 6

•

7 years ago

Oh. MozPromiseHolder::Re{solve,ject} clear the promise pointer, so that if your Ensure()->Then() happens before them then it works, but afterwards it chains onto a new promise that will never resolve. (In this case, the main thread was preempted in ContentParent::LaunchSubprocessAsync until after the I/O thread was already finished launching.) This seems like a footgun, so I filed a bug. This comment brought to you by rr chaos mode.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 7

•

7 years ago

Attached patch WIP (2018-08-29) (obsolete) — Details — Splinter Review

Work in progress. This finally seems to pass Try, but it needs some cleanup (and to be assembled into a number of patches greater than 1 but less than what's on my local branch).

Attachment #9005092 - Flags: review-

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

7 years ago

Blocks: 1487287

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Depends on: 1478145

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

See Also: → 1126681

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Depends on: 1488993

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Depends on: 1488994

Mike Conley (:mconley) (:⚙️)

Comment 8

•

6 years ago

Are you still working on this, jld?

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 9

•

6 years ago

Attached file Bug 1446161 - Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry. — Details

The CONTENT_PROCESS_LAUNCH_TIME_MS histogram is currently gathering times from two different spans of the launch process and mixing them together; it's at best a rough approximation of "launch time". In addition, with async launch we'll want to gather different metrics than for sync launch (see comments on bug 1474991); therefore, I've taken the opportunity to create a new histogram for sync launch time. Depends on D8940

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 10

•

6 years ago

Attached file Bug 1446161 - Remove the earlier attempt at async launch. — Details

The first attempt at async launch tried to hide the asynchrony inside IPC, by making the process seem to be launched enough to construct new channels and send it messages, and lazily blocking on the pid/handle. Unfortunately, in practice we wind up needing the pid/handle immediately, and this requirement is too deeply embedded in IPC for that to be viable. Depends on D8941

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 11

•

6 years ago

Attached file Bug 1446161 - Asynchronously launch preallocated content processes using MozPromise. — Details

There are several layers to this patch: 1. GeckoChildProcessHost now exposes a promise that's resolved when the process handle is available (or rejected if launch failed), as a nonblocking alternative to LaunchAndWaitForProcessHandle. 2. ContentParent builds on this with the private method LaunchSubprocessAsync and the public method PreallocateProcessAsync; synchronous launch continues to exist for the regular on-demand launch path, for the time being. 3. PreallocatedProcessManager now uses async launch, and handles the new "launch in progress" state appropriately. Depends on D8942

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Attachment #9005092 - Attachment is obsolete: true

Flags: needinfo?(jld)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

6 years ago

Blocks: 1500150

Phabricator Automation

Updated

•

6 years ago

Attachment #9017747 - Attachment description: Bug 1446161 - Replace CONTENT_PROCESS_LAUNCH_TIME_MS telemetry with CONTENT_PROCESS_SYNC_LAUNCH_MS. → Bug 1446161 - Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry.

Comment 12

•

6 years ago

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/37bf52f0e9cf Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry. r=mconley,chutten https://hg.mozilla.org/integration/autoland/rev/1bdf64b29121 Remove the earlier attempt at async launch. r=spohl,mccr8 https://hg.mozilla.org/integration/autoland/rev/80116391b7fe Asynchronously launch preallocated content processes using MozPromise. r=mccr8

Comment 13

•

6 years ago

Backout by apavel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b0e69b136883 Backed out 8 changesets (bug 1446161, bug 1487287, bug 1488993, bug 1474991, bug 1496608) for very frequent automation.py crashes on a CLOSED TREE

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 14

•

6 years ago

The backout is because of the assertions I mentioned in https://phabricator.services.mozilla.com/D8943#inline-53846, but I think the case where they fail was introduced in bug 1487287; it looks like the I/O thread can get woken up and do OnChannelConnected before the launch thread returns from launching the process. (See also bug 1496608 comment #3 which links to a failure.) I did push the entire stack to Try after adding those assertions, but maybe I needed to run more tests to catch this. It's possible that the rest of this could be relanded safely, without bug 1487287, but I'll want to do some more testing to verify that.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 15

•

6 years ago

I did some Try runs. Win10 ASan seems to be the most reliable way to reproduce the assertion failure, but 3x mochitests got me only two of those: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e24fe4cc9bc7fd3af10f4e8fe16198cea7379229&selectedJob=214407005 So I did 3x and then another 5x: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6eff48a0a73bc2379ab372e3935928e01226b522 https://treeherder.mozilla.org/#/jobs?repo=try&revision=e3fd17b6d744cf06fc01470481c83b535ef155d4 And all clear so far; the failures all look unrelated. I'm going to try to reland.

Comment 16

•

6 years ago

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/25b4fcd4382d Remove CONTENT_PROCESS_LAUNCH_TIME_MS telemetry. r=mconley,chutten https://hg.mozilla.org/integration/autoland/rev/3cbd2cf23c86 Remove the earlier attempt at async launch. r=spohl,mccr8 https://hg.mozilla.org/integration/autoland/rev/9f5e62b65302 Asynchronously launch preallocated content processes using MozPromise. r=mccr8

Cristina Coroiu [:ccoroiu]

Comment 17

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/25b4fcd4382d https://hg.mozilla.org/mozilla-central/rev/3cbd2cf23c86 https://hg.mozilla.org/mozilla-central/rev/9f5e62b65302

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox65: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla65

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Depends on: 1510934

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox61: affected → ---

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Blocks: 1517781

Mike Conley (:mconley) (:⚙️)

Updated

•

6 years ago

Blocks: 1521839

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P1

Whiteboard: [qf:p1:f64][fxperf:p1] → [fxperf:p1]

You need to log in before you can comment on or make changes to this bug.