Closed Bug 1356365 Opened 3 years ago Closed 2 years ago

Crash in mozilla::ipc::MessageChannel::WillDestroyCurrentMessageLoop

Categories

(Core :: IPC, defect, P1, critical)

55 Branch
x86
All
defect

Tracking

()

RESOLVED DUPLICATE of bug 1398070
Tracking Status
firefox55 --- affected
firefox56 --- unaffected
firefox57 blocking fixed
firefox58 --- affected

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, topcrash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-bce93404-ea89-4166-a0b5-53f842170413.
=============================================================

Seen while looking at nightly crash data - crashes started using 20170413030227 build: http://bit.ly/2pbBhxo

Possible regression window based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f40e24f40b4c4556944c762d4764eace261297f5&tochange=819a666afddc804b6099ee1b3cff3a0fdf35ec15

Bug 1349699? ni on billm
Flags: needinfo?(wmccloskey)
Any chance you could look at this David? It's an assertion where the compositor thread shuts down before the PCompositorBridgeParent protocol has closed (I had to click on the raw JSON file to find the ProtocolName field). I guess this user doesn't have a GPU process.
Component: IPC → Graphics
Flags: needinfo?(wmccloskey) → needinfo?(dvander)
This is now the top browser crash on Nightly.
Keywords: topcrash
Attached patch make DEBUG-onlySplinter Review
For now let's make this DEBUG-only. There's no reason to crash everybody.
Assignee: nobody → wmccloskey
Attachment #8859373 - Flags: review?(continuation)
Attachment #8859373 - Flags: review?(continuation) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/edda32a96d46
Make MessageChannel::WillDestroyCurrentMessageLoop assertion DEBUG-only (r=mccr8)
Attached patch fixSplinter Review
Milan managed to accidentally reproduce this on try by enabling the GPU process in more places, and I think this is the likely cause. VideoBridgeParent and VsyncBridgeParent both run on the compositor thread, but don't retain a reference to CompositorThreadHolder. As long the holder is alive, the compositor thread will spin an event loop to wait for other threads to receive ActorDestroys.

The fix is to just hold a ref in both of these actors.
Assignee: wmccloskey → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(dvander)
Attachment #8859919 - Flags: review?(matt.woodrow)
If this is the problem, why would the protocol be reported as PCompositorBridgeParent?
Probably we're seeing two different bugs, then. By xpcom-shutdown time, it's too late to report crash metadata in the GPU process, so I had to guess the protocol by looking at the stack.

This does appear to fix the issue Milan saw on try.
Attachment #8859919 - Flags: review?(matt.woodrow) → review+
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9fd7a385411f
Fix a race condition between VideoBridge and Compositor thread shutdown. (bug 1356365, r=mattwoodrow)
No crashes after 20170418030220 so far, which doesn't really line up to when this landed. But the crash rate dropped quite a bit after 4/21.
I disabled the assertion on the 18th, so we need to re-enable it to see what was fixed.
See Also: → 1398070
Assignee: dvander → nobody
Status: ASSIGNED → NEW
[Tracking Requested - why for this release]:

This bug is a top crasher.
OS: Windows 7 → All
This is the #1 top crasher for Nightly.

Top Crashers for Firefox 57.0a1
Top 50 Crashing Signatures. 7 days ago through 34 minutes ago. 

#1 	
15.42% 	
-7.29% 	
mozilla::ipc::MessageChannel::WillDestroyCurrentMessageLoop
application
	1452 	1437 	15 	0 	1251 	0 	2017-04-13
Huge volume of crash, marking it as blocking
Milan, do you know who could help on this?
Thanks
Flags: needinfo?(milan)
It's the MessageLoop destroyed before MessageChannel ... MOZ_CRASH firing.  Bill, David, this seems to have started recently (I think the oldest 57 build I see with this is 20170906100107 from https://crash-stats.mozilla.com/report/index/b76c1a69-a685-40c4-976d-d56530170909
Assignee: nobody → dvander
Flags: needinfo?(wmccloskey)
Flags: needinfo?(milan)
Flags: needinfo?(dvander)
Priority: -- → P1
Whiteboard: [gfx-noted]
I'm betting on the sandboxing change in bug 1229829.
Flags: needinfo?(jmathies)
Assignee: dvander → nobody
Component: Graphics → Security: Process Sandboxing
(In reply to Milan Sreckovic [:milan] from comment #17)
> I'm betting on the sandboxing change in bug 1229829.

Is this based on anything other than it was close?

It's certainly possible that the change would cause potentially odd regressions, but it landed in Nightly 20170905220108 and the number ramped up in 20170907220212 and has stayed consistently high since then.

It seems odd that we didn't get a similar number for any of 20170905220108, 20170906100107, 20170906220306 or 20170907100318.
Flags: needinfo?(milan)
Depends on: 1402340
Flags: needinfo?(jmathies)
The new offending protocol in comment #19 is PGMPParent.
Flags: needinfo?(dvander)
(In reply to Bob Owen (:bobowen) from comment #18)
> ...
> 
> It seems odd that we didn't get a similar number for any of 20170905220108,
> 20170906100107, 20170906220306 or 20170907100318.

https://crash-stats.mozilla.com/report/index/b76c1a69-a685-40c4-976d-d56530170909 is 20170906100107

But, no, I don't have any data that points to sandboxing beyond the timing.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #21)
> (In reply to Bob Owen (:bobowen) from comment #18)
> > ...
> > 
> > It seems odd that we didn't get a similar number for any of 20170905220108,
> > 20170906100107, 20170906220306 or 20170907100318.
> 
> https://crash-stats.mozilla.com/report/index/b76c1a69-a685-40c4-976d-
> d56530170909 is 20170906100107

Right that's only 1 report though, this also seemed to start on Mac in build 20170907220212.
So my guess is this isn't the alternate desktop, but we're going to back it out of Beta anyway for bug 1400637, so we'll know one way or the other soon.
Component: Security: Process Sandboxing → IPC
The recent spike was caused by the landing of bug 1395330. It's not related to sandboxing. There's more discussion in bug 1398070.
Flags: needinfo?(wmccloskey)
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1398070
No longer depends on: 1402340
Mirror the status of bug 1398070.
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.