Crash in mozilla::ipc::MessageChannel::WillDestroyCurrentMessageLoop

RESOLVED FIXED in Firefox 57

Status

()

defect
P1
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: calixte, Assigned: billm)

Tracking

(Depends on 1 bug, Blocks 1 bug, 4 keywords)

57 Branch
mozilla58
Unspecified
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox55 unaffected, firefox56+ unaffected, firefox57+ fixed, firefox58 fixed)

Details

(crash signature)

Attachments

(1 attachment)

This bug was filed from the Socorro interface and is 
report bp-1765336c-d113-4cb1-b2ef-6fafc0170908.
=============================================================

There are 17 crashes in nightly 57 with buildid 20170907220212.
The pushlog for this nightly:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d8e238b811d3dc74515065ae8cab6c74baf0295f&tochange=b4c1ad9565ee9d00d96501c4a83083daf25c1413

For information, the ProtocolName field is "PProcessHangMonitorParent".
:billm, could you investigate please ?
Flags: needinfo?(wmccloskey)
See Also: → 1356365
This is the #5 Windows topcrash in Nightly 20170908220146.
Keywords: topcrash
It looks like this is also happening frequently in automation.
See Also: → 1388764
Any chance you could take a look at this, Blake? I'm on PTO this week.

If necessary, we can back out bug 1395330 (for the second time). But it would be nice to know what's happening with ProcessHangMonitor. The code looks correct to me on quick inspection. I wonder if this is a side effect of a content process being shut down too late somehow.
Flags: needinfo?(wmccloskey)
Er, see comment 3.
Flags: needinfo?(mrbkap)
Assignee: nobody → mrbkap
Flags: needinfo?(mrbkap)
Pretty much all of the reports have MOZ_CRASH(MessageLoop destroyed before MessageChannel that's bound to it) as crash reason.

Tracking as topcrash.
Feels like a P1.
Priority: -- → P1
This is the #4 Windows topcrash in Nightly 20170915100121.
I wonder if bug 1397214 and family could be related to this?
as a note, I have tried to narrow down the failure in reftests in bug 1388764 comment 20.
billm theorized that this has to do with a child process lagging on shutdown. From my reading of the code, that seems to be the only likely cause. However, I don't really understand the ContentParent::KillHard path, so it's taking me some time to dig into what happens in that code. I hope to figure something out by tomorrow.
I'm not sure if the two issues are related, but at least some of the crashes are for the Compositor thread. The offending protocol for these appears to be PImageBridgeParent. We don't do any special shutdown of the protocol from the parent side, only release a static reference to the protocol object, we rely upon the child to call Close(), which clearly didn't happen. I do not know for certain, but I fear this may be related to bug 1389021.
(In reply to Andrew Osmond [:aosmond] from comment #11)
> ... but I fear this may be related to bug 1389021.

In which case the next 56 build starts showing this problem.  Heads up Liz.
Flags: needinfo?(lhenry)
Tracking a potential fix for the Compositor thread crash reports in bug 1401668.
See Also: → 1401668
Flags: needinfo?(lhenry)
(In reply to Andrew Osmond [:aosmond] from comment #13)
> Tracking a potential fix for the Compositor thread crash reports in bug
> 1401668.

This appears to have solved the compositor thread crashes, as we have gotten a good number of reports on 20170921100141, but none for PImageBridgeParent. For what it was worth, this appears to be related to a quick cycle between startup and shutdown, as the solution could only have fixed the crash if we shutdown in the small window between the MessageChannel opening and OnChannelConnected (it does a message post). It is possible the other crashes related to this.
OS: Windows 10 → All
This is the #1 top crasher for Nightly.

Top Crashers for Firefox 57.0a1
Top 50 Crashing Signatures. 7 days ago through 34 minutes ago. 

#1 	
15.42% 	
-7.29% 	
mozilla::ipc::MessageChannel::WillDestroyCurrentMessageLoop
application
	1452 	1437 	15 	0 	1251 	0 	2017-04-13
This patch covers up the problem. Rather than crash, it just causes us to avoid ever posting to the MessageLoop that's been shut down. It's safe in some sense, although it's really just covering up bugs elsewhere. Given the large volume of crashes, I think we should probably do something like this.
Attachment #8911416 - Flags: review?(dvander)
Comment on attachment 8911416 [details] [diff] [review]
wallpaper patch

Review of attachment 8911416 [details] [diff] [review]:
-----------------------------------------------------------------

Agree. Andrew and I discussed another alternative, requiring MessageChannels to be created with a refcounted object, to keep the associated thread alive.
Attachment #8911416 - Flags: review?(dvander) → review+
Duplicate of this bug: 1356365
For information, there are 1248 crashes for 943 installations in 57.0b0 (dev edition).
¡Hola!

This one is in 58 as well.

It bite me twice while trying to help :bobowen pin down https://bugzilla.mozilla.org/show_bug.cgi?id=1316665

bp-7013c932-c6e3-491c-acd2-b35060170925
	9/25/2017	9:10 AM
bp-ce8d1c52-d587-4787-ae0c-3efd70170925
	9/25/2017	8:16 AM

¡Gracias!
Alex
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/05e921246981
Convert WillDestroyCurrentMessageLoop assertion to a safe no-op (r=dvander)
Assignee: mrbkap → wmccloskey
https://hg.mozilla.org/mozilla-central/rev/05e921246981
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Can we flag this for uplift to 57? It's showing up there as well.
Flags: needinfo?(wmccloskey)
how do we know this is fixed?  we still see test automation failures in bug 1388764 :(
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #24)
> how do we know this is fixed?  we still see test automation failures in bug
> 1388764 :(

The patch in this bug simply disabled the crash in non-debug builds, so it is expect that we'll still see it in automation for debug builds.
Depends on: 1403310
To whomever gets to it first, we're building beta4 tomorrow (Thursday) and this crash is one of our top crashes on 57 at the moment. Can we please get an approval request ASAP?
Flags: needinfo?(dvander)
Comment on attachment 8911416 [details] [diff] [review]
wallpaper patch

Normally I would wait a few days to see the effect that this patch has on a number of related crashes. If we need to rush this, though, it's probably pretty safe. At worst we'll reduce overall crash rates but re-introduce a sec-critical crash. If that happens, we could always back the patch out, though.

Approval Request Comment
[Feature/Bug causing the regression]:bug 1395330
[User impact if declined]:crashes, mostly at shutdown
[Is this code covered by automated tests?]:a little bit
[Has the fix been verified in Nightly?]:no
[Needs manual test from QE? If yes, steps to reproduce]: none available
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]:if the patch fails to work, we could re-introduce crashes in PostTask_Helper, which are sec-critical
[Why is the change risky/not risky?]:see above
[String changes made/needed]:none
Flags: needinfo?(wmccloskey)
Attachment #8911416 - Flags: approval-mozilla-beta?
I will note that the crashes on Nightly with this signature stopped after the second 09-25 Nightly build, which was the final one without this patch.
Blocks: 1395330
Has Regression Range: --- → yes
Has STR: --- → no
Keywords: testcase-wanted
Comment on attachment 8911416 [details] [diff] [review]
wallpaper patch

I had a quick chat with Dan and the stability impact here is severe enough to uplift this in beta57 asap.
Attachment #8911416 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(dvander)
Duplicate of this bug: 1395211
You need to log in before you can comment on or make changes to this bug.