Closed Bug 1562763 Opened 6 months ago Closed 5 months ago

Isolate child process launching into a helper class

Categories

(Core :: IPC, task, P1)

task

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox70 --- fixed

People

(Reporter: bholley, Assigned: bholley)

References

Details

Attachments

(6 files)

At present, all the launching takes place in methods on GeckoChildProcessHost. This means sharing state across the main thread, the IO thread, and the launcher pool, and no clear separation between state needed for launching versus steady-state operation.

Cleaning this up is the first step for bug 1562762.

Having to namespace these into GeckoChildProcessHost is annoying. The
|using| declarations help to some extent, but it's easier to just put
them in mozilla::ipc.

This is intended to facilitate future work to break up
PerformAsyncLaunch into asynchronous pieces. It also aims to fix some
raciness around access to state in the GeckoChildProcessHost instance.
There appears to be at least one data race in the existing code, in which
OpenPrivilegedHandle are run on both the launcher thread and the IO
thread without synchronization.

There's one failure on mac, which I finally managed to reproduce locally via an opt build, only when running xpcshell tests in parallel, e.g. ./mach xpcshell-test browser/extensions/formautofill/

Thankfully, this one has a few more useful symbols than what I saw on try:

Thread 0 (crashed)
0 XUL!InvalidArrayIndex_CRASH(unsigned long, unsigned long) [nsTArray.cpp : 27 + 0x1e]
rax = 0x0000000103d3e340 rdx = 0x0000000000000000
rcx = 0x0000000103d3dd48 rbx = 0x00007ffeec268bc0
rsi = 0x0000000000000022 rdi = 0x0000000000000000
rbp = 0x00007ffeec268ae0 rsp = 0x00007ffeec268ae0
r8 = 0x00007ffeec268838 r9 = 0x000000000000000a
r10 = 0x00007ffeec268ab0 r11 = 0xfffffffff5184701
r12 = 0x0000000103f35dd0 r13 = 0x0000000103f828c0
r14 = 0x0000000103f828c8 r15 = 0x0000000103f72020
rip = 0x000000010981638e
Found by: given as instruction pointer in context
1 XUL!mozilla::dom::ContentChild::RecvInitRendering(mozilla::ipc::Endpoint<mozilla::layers::PCompositorManagerChild>&&, mozilla::ipc::Endpoint<mozilla::layers::PImageBridgeChild>&&, mozilla::ipc::Endpoint<mozilla::gfx::PVRManagerChild>&&, mozilla::ipc::Endpoint<mozilla::PRemoteDecoderManagerChild>&&, nsTArray<unsigned int>&&) [ContentChild.cpp : 1483 + 0x9]
rbp = 0x00007ffeec268b20 rsp = 0x00007ffeec268af0
rip = 0x000000010bd642c5

I'll attach the full minidump output.

(In reply to Bobby Holley (:bholley) from comment #3)

1 XUL!mozilla::dom::ContentChild::RecvInitRendering(mozilla::ipc::Endpoint<mozilla::layers::PCompositorManagerChild>&&, mozilla::ipc::Endpoint<mozilla::layers::PImageBridgeChild>&&, mozilla::ipc::Endpoint<mozilla::gfx::PVRManagerChild>&&, mozilla::ipc::Endpoint<mozilla::PRemoteDecoderManagerChild>&&, nsTArray<unsigned int>&&) [ContentChild.cpp : 1483 + 0x9]

So that's here:

  if (!CompositorManagerChild::Init(std::move(aCompositor), namespaces[0])) {

where namespaces comes from GPUProcessManager::CreateContentBridges in the parent process, and that would yield an empty array if any of it fails, and that failure is checked only by debug assertions.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #5)

where namespaces comes from [GPUProcessManager::CreateContentBridges][loc1] in the parent process, and that would yield an empty array if any of it fails, and that failure is checked only by debug assertions.

Yeah. So the opt-only nature if this is either (a) there's raciness around when CreateContentBridges is called and whether or not it succeeds, (b) the MOZ_ASSERT in the child process happens early enough that it triggers a relaunch when it's hit, or both.

In any case, I have enough to debug this now.

We end up bailing out here because the compositor thread isn't set up yet: https://searchfox.org/mozilla-central/rev/8a990595ce6d5ed07ace2d4d5d86cc69aec90bde/gfx/layers/ipc/CompositorManagerParent.cpp#62

So presumably there's a latent race here. I'll finish debugging tomorrow.

So this is the stack where CreateContentBridges fails. The key point is that we're in CompositorThreadHolder::Shutdown, having nulled out the compositor thread and spinning the event loop until shutdown is complete. ContentParent::InitInternal is being called slightly later with my patches (more promises -> more round-trips through the event loop), and therefore it happens after we've already started shutting the compositor thread down.

The Right Solution, I think, is to make ContentParent initialization fallible.

Depending on the timing of events, we can invoke InitInternal after
we've already torn down the compositor thread, which can make
CreateContentBridges fail, which in turn makes us unable to make a
correct InitRendering call. Implement the infrastructure to bail out in
this case.

Priority: -- → P1
Blocks: 1563832

This can happen if the parent starts shutting down before a child has
finished starting up.

The main thread case in the child has always just warned, and this MOZ_CRASH was
added recently to the dispatch case in [1]. This can happen if the
parent starts shutting down while the child is still initializing.

[1] https://hg.mozilla.org/mozilla-central/rev/bd8fd9a22dc08d63cec68de475182a1b8eeeca4b

And here's a try run showing mac xpcshell green across retriggers: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8e10b55fb38de49d70e92e607fad3b92ff9f192

No longer blocks: 1562762
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/073b4d39b0f3
Don't crash child or parent on failed InitBackground. r=nika
https://hg.mozilla.org/integration/autoland/rev/ae890583cd76
Make ContentParent::InitInternal fallible. r=jld
https://hg.mozilla.org/integration/autoland/rev/6035908e789b
Move some types around. r=jld
https://hg.mozilla.org/integration/autoland/rev/11d1d46d11fb
Factor process launching into a helper class, and use more promises. r=jld
You need to log in before you can comment on or make changes to this bug.