Closed Bug 1408532 Opened 3 years ago Closed 3 years ago
Crash in shutdownhang | mozilla::layers::Compositor
+++ This bug was initially created as a clone of Bug #1402592 +++ the patch for bug 1402592 landing in 57.0b4 has greatly reduced the volume of these crashes but they are still occurring at a rate of ~50 crashes per day on current beta versions which makes it the #5 top shutdownhang. all the crashes are from windows 7 installations. Correlations for Firefox Beta (42.20% in signature vs 00.98% overall) GFX_ERROR "Killing GPU process due to IPC reply timeout" = true [99.34% vs 03.09% if process_type = null] (38.72% in signature vs 02.71% overall) GFX_ERROR "Receive IPC close with reason=AbnormalShutdown" = true
giving the bug a better title...
Summary: Crash in shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS | mozilla::detail::ConditionVariableImpl::wait | mozilla::CondVar::Wait | mozilla::ThreadEventQueue<T>::GetEvent → Crash in shutdownhang | mozilla::layers::CompositorThreadHolder::Shutdown
Not sure we have a solution that's appropriate for uplift, but maybe David is up to speed with Andrew's work (Andrew is back on 10/23)
A CompositorThreadHolder is still being leaked. Without Andrew's earlier instrumentation patch it's hard to say who is responsible. The typical pattern is we scope the thread's addref/release to the lifetime of an actor, but maybe that's not restrictive enough. Perhaps it should be acquired after the bind (when IPDL is connected), and released in ActorDestroy (when IPDL is disconnected). Then even if an actor is leaked there won't be a shutdown hang. That would be an easy patch to write... alternately, having MessageChannel do the AddRef/Release would be best but I'm not too comfortable uplifting that change quickly.
This is a short in the dark, but assuming that we're leaking something, it would force us to release the compositor thread ref when IPDL disconnects, instead of when the actor is destroyed. I think that should be safe. This won't fix anything if the bug is we hold a compositor thread ref, but don't attach to IPDL. I did an audit of the code and couldn't immediately see that that's possible, but I think having the MessageChannel directly hold a CompositorThread ref would make that problem easier. If the problem is still that IPDL is not seeing a disconnect *at all*, then none of this will do anything.
Attachment #8919158 - Flags: review?(wmccloskey)
See Also: → 1399453
https://crash-stats.mozilla.com/report/index/34506a35-9117-4a7a-9bc1-66dec0171026#tab-metadata We only have one in thus far (low reproduction rate on nightly), but this crash report blames ImageBridgeParent.
Comment on attachment 8919158 [details] [diff] [review] patch Review of attachment 8919158 [details] [diff] [review]: ----------------------------------------------------------------- I don't really understand why this would help, but I don't see how it could hurt either.
Attachment #8919158 - Flags: review?(wmccloskey) → review+
This is my stab at fixing it. As described in the patch comments, I suspect the process ID is getting reused, we create a new ImageBridgeParent for that process, and the IPDL never called ImageBridgeParent::ActorDestroy to release the old one. This would mean the compositor thread never gets released and causes a hang. We know for whatever reason the content process shutdown doesn't always release the IPDL objects (hence why we've needed to start adding Close calls in the parent -- which is where we want to go anyways in truth). We also know that the GPU process can get the same process ID when restarted as I observed this in the wild with the CompositorManager* changes. No reason why it couldn't happen with the content processes.
Assignee: nobody → aosmond
Status: NEW → ASSIGNED
Attachment #8924539 - Flags: review?(dvander)
Attachment #8919158 - Attachment is obsolete: true
Attachment #8924539 - Flags: review?(dvander) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/05676eb12f34 Ensure ImageBridgeParent cleans up old actors if the process ID is reused. r=dvander
Looks like this was indeed the reason for the hangs.
You need to log in before you can comment on or make changes to this bug.