Closed Bug 1408532 Opened 2 years ago Closed 2 years ago

Crash in shutdownhang | mozilla::layers::CompositorThreadHolder::Shutdown

Categories

(Core :: Graphics: Layers, defect, P2, critical)

57 Branch
All
Windows 7
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- fixed

People

(Reporter: philipp, Assigned: aosmond)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1402592 +++

the patch for bug 1402592 landing in 57.0b4 has greatly reduced the volume of these crashes but they are still occurring at a rate of ~50 crashes per day on current beta versions which makes it the #5 top shutdownhang.

all the crashes are from windows 7 installations.

Correlations for Firefox Beta
(42.20% in signature vs 00.98% overall) GFX_ERROR "Killing GPU process due to IPC reply timeout" = true [99.34% vs 03.09% if process_type = null]
(38.72% in signature vs 02.71% overall) GFX_ERROR "Receive IPC close with reason=AbnormalShutdown" = true
giving the bug a better title...
Summary: Crash in shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS | mozilla::detail::ConditionVariableImpl::wait | mozilla::CondVar::Wait | mozilla::ThreadEventQueue<T>::GetEvent → Crash in shutdownhang | mozilla::layers::CompositorThreadHolder::Shutdown
Not sure we have a solution that's appropriate for uplift, but maybe David is up to speed with Andrew's work (Andrew is back on 10/23)
Flags: needinfo?(dvander)
Flags: needinfo?(aosmond)
Priority: -- → P2
Whiteboard: [gfx-noted]
A CompositorThreadHolder is still being leaked. Without Andrew's earlier instrumentation patch it's hard to say who is responsible. The typical pattern is we scope the thread's addref/release to the lifetime of an actor, but maybe that's not restrictive enough. Perhaps it should be acquired after the bind (when IPDL is connected), and released in ActorDestroy (when IPDL is disconnected). Then even if an actor is leaked there won't be a shutdown hang.

That would be an easy patch to write... alternately, having MessageChannel do the AddRef/Release would be best but I'm not too comfortable uplifting that change quickly.
Flags: needinfo?(dvander)
Attached patch patch (obsolete) — Splinter Review
This is a short in the dark, but assuming that we're leaking something, it would force us to release the compositor thread ref when IPDL disconnects, instead of when the actor is destroyed. I think that should be safe.

This won't fix anything if the bug is we hold a compositor thread ref, but don't attach to IPDL. I did an audit of the code and couldn't immediately see that that's possible, but I think having the MessageChannel directly hold a CompositorThread ref would make that problem easier.

If the problem is still that IPDL is not seeing a disconnect *at all*, then none of this will do anything.
Attachment #8919158 - Flags: review?(wmccloskey)
Flags: needinfo?(aosmond)
See Also: → 1399453
https://crash-stats.mozilla.com/report/index/34506a35-9117-4a7a-9bc1-66dec0171026#tab-metadata

We only have one in thus far (low reproduction rate on nightly), but this crash report blames ImageBridgeParent.
Comment on attachment 8919158 [details] [diff] [review]
patch

Review of attachment 8919158 [details] [diff] [review]:
-----------------------------------------------------------------

I don't really understand why this would help, but I don't see how it could hurt either.
Attachment #8919158 - Flags: review?(wmccloskey) → review+
This is my stab at fixing it. As described in the patch comments, I suspect the process ID is getting reused, we create a new ImageBridgeParent for that process, and the IPDL never called ImageBridgeParent::ActorDestroy to release the old one. This would mean the compositor thread never gets released and causes a hang. We know for whatever reason the content process shutdown doesn't always release the IPDL objects (hence why we've needed to start adding Close calls in the parent -- which is where we want to go anyways in truth). We also know that the GPU process can get the same process ID when restarted as I observed this in the wild with the CompositorManager* changes. No reason why it couldn't happen with the content processes.
Assignee: nobody → aosmond
Status: NEW → ASSIGNED
Attachment #8924539 - Flags: review?(dvander)
Attachment #8924539 - Flags: review?(dvander) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/05676eb12f34
Ensure ImageBridgeParent cleans up old actors if the process ID is reused. r=dvander
https://hg.mozilla.org/mozilla-central/rev/05676eb12f34
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Looks like this was indeed the reason for the hangs.
Doesn't match the esr52 criteria
You need to log in before you can comment on or make changes to this bug.