Closed Bug 1363601 Opened 2 years ago Closed 2 years ago

Crash in mozilla::ipc::MessageChannel::Clear from ~PContentParent()

Categories

(Core :: DOM: Content Processes, defect, critical)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 + fixed
firefox56 --- fixed

People

(Reporter: ting, Assigned: gkrizsanits)

References

Details

(Keywords: crash, regression, Whiteboard: [e10s-multi:?])

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-c3a28f8d-7da4-4940-9288-f42a90170510.
=============================================================
Top #12 of Nightly 20170507030205 on Windows, 14 reports from 11 installations. The moz crash reason is:

  MOZ_CRASH(MessageChannel destroyed without being closed)
I think this is the set of changesets added in the 5-4 Nightly, though nothing obviously related jumps out at me:
  https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=82c2d17e74ef9cdf38a5d5ac4eb3ae846ec30ba4&tochange=33b92d9c40562dab3d7b602368c75619f1d793f7

92 of these crashes look more or less like the crash in comment 0, from ~ContentParent, getting run in the FreeSnowWhite call during CC shutdown.

There are another 40 from the GPU process. I'll file another bug for that.
Crash Signature: [@ mozilla::ipc::MessageChannel::Clear] → [@ mozilla::ipc::MessageChannel::Clear] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ]
In the last 7 days there have been 89 crashes with the signature in Firefox 55.0a1, across 70 installations, making it #15 top crasher.
[Tracking Requested - why for this release]: this seems to have risen to the #2 browser crash in 55.0b9
Keywords: regression
Hi Andrew and Bill,
Could either of you help take a look at this? The rising crash volume is concerning for 55. Thanks!
Flags: needinfo?(continuation)
tracking as topcrash in beta55
I'm not entirely sure what this assertion is, but 100% of the beta crashes have ShutdownProgress equal to xpcom-shutdown
(The first signature is no longer present, due to skip list changes.)
Crash Signature: [@ mozilla::ipc::MessageChannel::Clear] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ] → [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ]
This assertion seems like it would be racy for PContentParent. I think when we're shutting down, at some point we call PContentParent::ShutDownProcess, which sends a message to the child and starts a force kill timer. Is there anything that is going to prevent shutdown from going along further until either we get a reply from the child or the force kill timer runs? Though somehow the refcount of ContentParent is hitting zero.

This comment in ContentParent::ActorDestroy seems like it might be an issue:
    // If we shut down normally but haven't called Close, assume somebody
    // else called Close on us. In that case, we still need to call
    // ShutDownProcess below to perform other necessary clean up.

Anyways, I don't think I know enough about this to do more, so hopefully Bill will have a chance to look at it.
Flags: needinfo?(continuation)
the increase on beta seems to coincide with fixing the rollout of e10s-multi in bug 1380725.
this is the crashing graph on beta for the e10s cohort "multibucket4": http://bit.ly/2u9EuPO
Crash Signature: [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ] → [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent…
Whiteboard: [e10s-multi:?]
Crash Signature: [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent… → [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent ]
Since I remember hitting this in a very edge case scenario, with the preallocated process manager (which was fixed before landing it) I'm considering turning it off completely on beta if we don't find another solution to see if that helps here.
Based on the difference between nightly and beta, I wonder if this was caused by an uplift.

Since all the crashes are from Windows I wonder if this is somehow related to the GPU process. Dave, you uplifted a modified version of the patch from bug 1374258 and I was wondering if that could be related.
Flags: needinfo?(dvander)
FWIW, I don't think this can be the GPU process. This is pretty clearly a ContentParent (and PContentParent) that is dying and the GPU process doesn't use either. I've been staring at the code for a while now and I don't understand how this is possible. All ContentParents that have Init called on them listen for xpcom-shutdown and when they receive the notification, they synchronously wait for the process to shut down [1]. I spent a bunch of time trying to find a path through the ContentParent shutdown code that didn't end either by crashing the process or by calling Close and couldn't find anything (though I don't understand how the code to force a crash actually works).

I also looked at the differences between Nightly and Beta in the preallocated process manager and didn't see anything that jumps out. We added some checks that we don't try to reuse a shutdown content process, but the problem here is that we have a process that was never shut down. I also tried verifying that if we get into ContentParent::ActorDestroy (where we unregister from watching xpcom-shutdown) that it must mean that the MessageChannel was closed (either in error or not) and it appeared to be.

One Hail Mary idea I had was to try the sync wait in xpcom-shutdown in the process manager, but I really don't understand how it could be needed.

[1] http://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/dom/ipc/ContentParent.cpp#2760,2768
(In reply to Blake Kaplan (:mrbkap) from comment #14)
> FWIW, I don't think this can be the GPU process. This is pretty clearly a
> ContentParent (and PContentParent) that is dying and the GPU process doesn't
> use either. I've been staring at the code for a while now and I don't

You're right, I set that flag before we validated the protocol name for the new crashes.
So, I'm unchecking the needinfo flag for David.

> understand how this is possible. All ContentParents that have Init called on
> them listen for xpcom-shutdown and when they receive the notification, they
> synchronously wait for the process to shut down [1]. I spent a bunch of time
> trying to find a path through the ContentParent shutdown code that didn't
> end either by crashing the process or by calling Close and couldn't find
> anything (though I don't understand how the code to force a crash actually
> works).
> 
> I also looked at the differences between Nightly and Beta in the
> preallocated process manager and didn't see anything that jumps out. We
> added some checks that we don't try to reuse a shutdown content process, but
> the problem here is that we have a process that was never shut down. I also
> tried verifying that if we get into ContentParent::ActorDestroy (where we
> unregister from watching xpcom-shutdown) that it must mean that the
> MessageChannel was closed (either in error or not) and it appeared to be.
> 
> One Hail Mary idea I had was to try the sync wait in xpcom-shutdown in the
> process manager, but I really don't understand how it could be needed.
> 
> [1]
> http://searchfox.org/mozilla-central/rev/
> 3a3af33f513071ea829debdfbc628caebcdf6996/dom/ipc/ContentParent.cpp#2760,2768

Actually there are some crashes on nightly as well just less frequent. I think
you were checking the right thing and what occurred to me is that the shutdown
listener for the preallocated process is triggered first because it listens to
"profile-change-teardown" [1] which comes before the "profile-before-change" a
regular content parent listens to. And for the preallocated process we don't
do the sync waiting for process shutdown, just this: [2]. I'll write a patch.

[1]: http://searchfox.org/mozilla-central/rev/a83a4b68974aecaaacdf25145420e0fe97b7aa22/dom/ipc/PreallocatedProcessManager.cpp#134
[2]: http://searchfox.org/mozilla-central/rev/a83a4b68974aecaaacdf25145420e0fe97b7aa22/dom/ipc/PreallocatedProcessManager.cpp#284
Flags: needinfo?(dvander)
Assignee: nobody → gkrizsanits
Attachment #8888374 - Flags: review?(mrbkap)
Attachment #8888374 - Flags: review?(mrbkap) → review+
Approval Request Comment
[Feature/Bug causing the regression]: preallocated process manager
[User impact if declined]: frequent crashes
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: it's already turned off on nightly for activity stream
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: it's an undertested feature I'm just turning it off
[String changes made/needed]: none
Attachment #8888388 - Flags: approval-mozilla-beta?
Comment on attachment 8888388 [details] [diff] [review]
beta uplift version

Turning this off as it causes crashes with e10s-multi.
Attachment #8888388 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #17)
> [Is this code covered by automated tests?]: yes
> [Has the fix been verified in Nightly?]: it's already turned off on nightly
> for activity stream
> [Needs manual test from QE? If yes, steps to reproduce]: no

Setting qe-verify- based on Gabor's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1044cf0518e9
Disabling the preallocated process manager on all channels. r=mrbkap
https://hg.mozilla.org/mozilla-central/rev/1044cf0518e9
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
On nightly, I had already turned off the preallocated process manager in bug 1381804. There were a couple talos performance regression/improvements associated to it.

tp5 memory improvement: bug 1381804 comment 21
damp regression: bug 1382608
Depends on: 1382608
No longer depends on: 1383830
I think this will do the trick. I will have to fix bug 1376895 before I can re-enable the ppm on nightly though... but on try it looks OK with the ppm on, so I would like to land it.

(also removing the needinfo flag from Bill)
Flags: needinfo?(wmccloskey)
Attachment #8890877 - Flags: review?(mrbkap)
Comment on attachment 8890877 [details] [diff] [review]
ppm shutdownfix. v1

Review of attachment 8890877 [details] [diff] [review]:
-----------------------------------------------------------------

Either we should re-open this bug or track this patch in a new bug. It's odd to me to work in a FIXED bug like this.

I have a question I'd like the answer to before stamping this.

::: dom/ipc/PreallocatedProcessManager.cpp
@@ +134,5 @@
>        os->RemoveObserver(this, "profile-change-teardown");
>      }
> +    // Let's prevent any new preallocated processes to start, but also,
> +    // we should let the ConentParent to handle the shutdown of the existing
> +    // process at this point.

"Let's prevent any new preallocated processes from starting." "ContentParent".

Do we have to worry about the strong ref to mPreallocatedProcess not getting cleared any more and holding the ContentParent alive through shutdown?
Attachment #8890877 - Flags: review?(mrbkap)
(In reply to Blake Kaplan (:mrbkap) from comment #26)
> 
> Either we should re-open this bug or track this patch in a new bug. It's odd
> to me to work in a FIXED bug like this.

Oh, I thought I reopened it... anyway, even if this patch is the real fix probably for this crash I think we should do this in a new bug. I'll filed: bug 1385252.
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c91659401a6d
Let the ContentParent destroy the preallocated process during shutdown. r=mrbkap
My intention was to land this patch under bug 1385252 :( sorry for the confusion I caused here.
See Also: → 1423261
See Also: → 1385252
No longer blocks: 1349699
Depends on: 1349699
You need to log in before you can comment on or make changes to this bug.