Closed Bug 1268559 Opened 4 years ago Closed 3 years ago

[e10s] Shutdown hang in mozilla::layers::CompositorBridgeParent::ShutDown()

Categories

(Core :: Graphics: Layers, defect, P1, critical)

48 Branch
x86_64
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
e10s + ---
firefox46 --- unaffected
firefox47 --- unaffected
firefox48 + fixed
firefox49 + fixed
firefox-esr45 --- unaffected
firefox50 --- fixed

People

(Reporter: rick3162, Assigned: billm)

References

(Depends on 1 open bug)

Details

(4 keywords, Whiteboard: [gfx-noted])

Crash Data

Attachments

(8 files, 1 obsolete file)

Using Nightly x64 in win10 x64, with e10s on.
In my current profile I only have these addons installed:
Greasemonkey 3.8 beta 3,
Adblock Plus 2.7.3 latest beta (and EHH for ABP 1.3.7 latest beta)
and RES 4.6.1.


In the last days, whenever I close Nightly
most of the times the firefox.exe process doesn't exit immediately (as I see in Task Manager),
but after ~10-20 sec it crashes (I get the Mozilla Crash Reporter).

Here is the latest crash report: 
https://crash-stats.mozilla.com/report/index/bp-7d08f9da-7c33-48d0-b6fa-b484b2160428
All such crash reports are entitled: 
"shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent"

The above report doesn't show any relevant bug link in: More Reports|'Bugzilla' tab 
and I couldn't find any relevant bug in bugzilla neither.
It only shows that there have been already ~100 crash reports about this crash so far, all in win 10.




PS. With my current profile having e10s off,
I couldn't recreate it.
Component: General → Untriaged
[Tracking Requested - why for this release]: Regression
Severity: normal → critical
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ]
tracking-e10s: --- → ?
Component: Untriaged → General
Product: Firefox → Core
Summary: firefox.exe doesn't immediately exit when I close Nightly. Instead, It crashes after a few seconds. The crash reports are entitled" "shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent" → [e10s] Firefox crashes at [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ]
Whiteboard: ShutDownKill
https://crash-stats.mozilla.com/report/list?signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent&range_value=7&range_unit=days&date=2016-04-28


https://crash-stats.mozilla.com/report/index/7d08f9da-7c33-48d0-b6fa-b484b2160428
Crashing Thread (56)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog
 						toolkit/components/terminator/nsTerminator.cpp
1 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
2 	nss3.dll 	pr_root 		nsprpub/pr/src/md/windows/w95thred.c
3 	ucrtbase.dll 	crt_at_quick_exit 	
4 	kernel32.dll 	BaseThreadInitThunk 	
5 	ntdll.dll 	RtlUserThreadStart
Component: General → XUL
Version: Trunk → 48 Branch
Component: XUL → Async Tooling
Product: Core → Toolkit
I looked at about a dozen of these crashes. In almost all of them, the main thread was in mozilla::layers::CompositorBridgeParent::ShutDown(), spinning the event loop or something. This looks like it might be the top crash for the 4/27 Nightly, at least so far.
Component: Async Tooling → Graphics: Layers
Product: Toolkit → Core
This other signature is a minor variation, and is also common.
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] → [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ]
Keywords: topcrash
Summary: [e10s] Firefox crashes at [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] → [e10s] Shutdown hang in mozilla::layers::CompositorBridgeParent::ShutDown()
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] → [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang…
This is the #1 *and* #2 topcrash in Nightly 20160429030215 and Nightly 20160430030223.
Bas, can you please take a look?
Flags: needinfo?(bas)
(In reply to blinky from comment #6)
> Regression range:
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=fa784f0160f0f5c26714eef795a11fe41454e53f&tochange=4c3b
> 5f9bd44a64ff81a49653881c5415b088a504
> 
> Regressed by: bug 1262898

Ah, presumably this is another IPDL issue where under some circumstances the actor destructions aren't properly called. I'll have a look at this, in the worst case scenario we can decide to leak the Compositor Thread for now, it's better than leaking everything like we used to do. I will have a look at this. Although we'll probably need a reproducible scenario in order to find out what's going wrong in IPDL, I wonder if this happens when a content process crashes.
Assignee: nobody → bas
Flags: needinfo?(bas)
Whiteboard: ShutDownKill → ShutDownKill [gfx-noted]
Removing the "regressionwindow-wanted" keyword based on comment 6.
Hey Milan, this sound serious. Should we block?
Blocks: e10s-gfx
Flags: needinfo?(milan)
Sure, but we should confirm whether this is in 48 or just in 49.  It landed right on 4/25, I don't know if the branch was made before or after it.
Flags: needinfo?(milan)
Bug 1268559 is in 48, per the target milestone flag set by the merge script. Just to confirm, it is in the Aurora repo: https://hg.mozilla.org/releases/mozilla-aurora/rev/4c3b5f9bd44a
> Bug 1268559 is in 48
Sorry, I meant bug 1262898.
This is the #1, #2 and #3 top crash in Nightly 20160501030217, with 75 crashes, which is a *lot*.
Duplicate of this bug: 1267688
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownha… → [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ]
Duplicate of this bug: 1267688
(In reply to blinky from comment #6)
> Regression range:
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=fa784f0160f0f5c26714eef795a11fe41454e53f&tochange=4c3b
> 5f9bd44a64ff81a49653881c5415b088a504
> 
> Regressed by: bug 1262898

For what it's worth, if I look at the history of these crashes on mozilla-central nightlies:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2015-10-01&signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent#aggregations
I see two regression points.  They started in build 20160325085113 (although there wasn't quite a report every day, so it's not quite a reliable one-day range), and then got a lot worse around a month later (the above quoted regression range).


The one-day window for the first regression would be:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6202ade0e6d688ffb67932398e56cfc6fa04ceb3&tochange=d5f3da0cfe7ccf846c354014c9b059fad6ba0de5
but a larger 2 day window would be:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=efe7d026ac641759838dd3897c37892e37e5b244&tochange=d5f3da0cfe7ccf846c354014c9b059fad6ba0de5

(I wonder if bug 1258479 is possible?)
Milan, your comment 37 here indicates we're waiting on debugging features before we fix this. This is currently an M9 though and is blocking e10s. It's also the #1 top crash on aurora right now. Do we need to prioritize bug 1270172? It's currently tracking plus P3.
Flags: needinfo?(milan)
We anticipate that the work for this is in bug 1264694, and we will try to get somewhere without bug 1270172, but we really could use that information.  Right now it's been more than a week of actively looking at this problem with limited success.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #40)
> We anticipate that the work for this is in bug 1264694, and we will try to
> get somewhere without bug 1270172, but we really could use that information.
> Right now it's been more than a week of actively looking at this problem
> with limited success.

For the record, at this point we've verified bug 1264694 is a media issue which most likely -not- the issue we're seeing here (for a number of reasons).

The only way I've been able to reproduce this issue locally is simply create a load of work for the child process that simply takes longer than the shutdown kill timer (i.e. just load 40 complicated to load tabs and close the browser). This is perfectly possible, but it's hard to know if it's in any way related to what these users are having their child processes hung in. For what it's worth, if this -is- the cause, we should see bug 1262898 causing this -signature- to regress, but not overall crashiness, as previously these users would also have been hanging, just with a different stack trace. (I'll leave it to people more adapt at analyzing crash stats to determine if this is true this).
(In reply to Bas Schouten (:bas.schouten) from comment #41)
> (In reply to Milan Sreckovic [:milan] from comment #40)
> > We anticipate that the work for this is in bug 1264694, and we will try to
> > get somewhere without bug 1270172, but we really could use that information.
> > Right now it's been more than a week of actively looking at this problem
> > with limited success.
> 
> For the record, at this point we've verified bug 1264694 is a media issue
> which most likely -not- the issue we're seeing here (for a number of
> reasons).
> 
> The only way I've been able to reproduce this issue locally is simply create
> a load of work for the child process that simply takes longer than the
> shutdown kill timer (i.e. just load 40 complicated to load tabs and close
> the browser). This is perfectly possible, but it's hard to know if it's in
> any way related to what these users are having their child processes hung
> in. For what it's worth, if this -is- the cause, we should see bug 1262898
> causing this -signature- to regress, but not overall crashiness, as
> previously these users would also have been hanging, just with a different
> stack trace. (I'll leave it to people more adapt at analyzing crash stats to
> determine if this is true this).

Which shutdown kill timer? The main process hang monitor or the ContentParent force kill timer?

Looking at the signatures here it looks like the hang is in CompositorBridgeParent::ShutDown() during XPCOM destruction. This happen long after the last ContentParent is destroyed and the associated ContentChild process is gone. Or are there other signatures lurking in crashreports that I'm missing?
Whiteboard: ShutDownKill [gfx-noted] → [gfx-noted]
Depends on: 1272911
Flags: needinfo?(bas)
Duplicate of this bug: 1273514
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] → [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownha…
(In reply to Jim Mathies [:jimm] from comment #42)
> (In reply to Bas Schouten (:bas.schouten) from comment #41)
> > (In reply to Milan Sreckovic [:milan] from comment #40)
> > > We anticipate that the work for this is in bug 1264694, and we will try to
> > > get somewhere without bug 1270172, but we really could use that information.
> > > Right now it's been more than a week of actively looking at this problem
> > > with limited success.
> > 
> > For the record, at this point we've verified bug 1264694 is a media issue
> > which most likely -not- the issue we're seeing here (for a number of
> > reasons).
> > 
> > The only way I've been able to reproduce this issue locally is simply create
> > a load of work for the child process that simply takes longer than the
> > shutdown kill timer (i.e. just load 40 complicated to load tabs and close
> > the browser). This is perfectly possible, but it's hard to know if it's in
> > any way related to what these users are having their child processes hung
> > in. For what it's worth, if this -is- the cause, we should see bug 1262898
> > causing this -signature- to regress, but not overall crashiness, as
> > previously these users would also have been hanging, just with a different
> > stack trace. (I'll leave it to people more adapt at analyzing crash stats to
> > determine if this is true this).
> 
> Which shutdown kill timer? The main process hang monitor or the
> ContentParent force kill timer?
> 
> Looking at the signatures here it looks like the hang is in
> CompositorBridgeParent::ShutDown() during XPCOM destruction. This happen
> long after the last ContentParent is destroyed and the associated
> ContentChild process is gone. Or are there other signatures lurking in
> crashreports that I'm missing?

This happens after the ContentParent is destroyed, but -not- after the child process is gone. Since the CompositorBridge and TextureBridge are still alive and we'll leave the child alive until those are shutdown.

So no, the child process is -not- shutdown yet when we get to this, we are waiting for it to shutdown.
Flags: needinfo?(bas)
(In reply to Jonathan Howard from comment #45)
> From what I can tell (if code below is run); The child(ren) should be killed
> if this lasts more than 5 seconds, if not exited normally in time.
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749
> https://dxr.mozilla.org/mozilla-central/rev/
> a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108
> https://dxr.mozilla.org/mozilla-central/rev/
> a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864

I don't think this happens when the ContentParent is successfully shutdown, which in this case I believe it has. But last I checked was two weeks ago, I will check again (two weeks ago I could reproduce trivially by just putting an if (XRE_IsContentProcess()) { Sleep(70000); } in XPCOM shutdown.
(In reply to Jonathan Howard from comment #45)
> From what I can tell (if code below is run); The child(ren) should be killed
> if this lasts more than 5 seconds, if not exited normally in time.
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749
> https://dxr.mozilla.org/mozilla-central/rev/
> a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108
> https://dxr.mozilla.org/mozilla-central/rev/
> a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864

So, currently on trunk, yes, it seems like the children are killed very hard and fast indeed. I don't see this crash occurring after the 12th anymore either. Presumably something landed that solidly kills the child now, doesn't appear like it's sending crash reports for those child processes though.
(In reply to Bas Schouten (:bas.schouten) from comment #47)
> (In reply to Jonathan Howard from comment #45)
> > From what I can tell (if code below is run); The child(ren) should be killed
> > if this lasts more than 5 seconds, if not exited normally in time.
> > 
> > https://dxr.mozilla.org/mozilla-central/rev/
> > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749
> > https://dxr.mozilla.org/mozilla-central/rev/
> > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108
> > https://dxr.mozilla.org/mozilla-central/rev/
> > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864
> 
> So, currently on trunk, yes, it seems like the children are killed very hard
> and fast indeed. I don't see this crash occurring after the 12th anymore
> either. Presumably something landed that solidly kills the child now,
> doesn't appear like it's sending crash reports for those child processes
> though.

Not much here that looks suspect - 

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=674a552743785c28c75866969aad513bd8eaf6ae&tochange=c3f5e6079284a7b7053c41f05d0fe06ff031db03
Windows Nightly crash symbolification broke about a week ago. Maybe that's related to there not being any signatures any more...
Depends on: 1264694
(In reply to Bas Schouten (:bas.schouten) from comment #47)
> (In reply to Jonathan Howard from comment #45)
> > From what I can tell (if code below is run); The child(ren) should be killed
> > if this lasts more than 5 seconds, if not exited normally in time.
> > 
> > https://dxr.mozilla.org/mozilla-central/rev/
> > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749
> > https://dxr.mozilla.org/mozilla-central/rev/
> > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108
> > https://dxr.mozilla.org/mozilla-central/rev/
> > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864
> 
> So, currently on trunk, yes, it seems like the children are killed very hard
> and fast indeed. I don't see this crash occurring after the 12th anymore
> either. Presumably something landed that solidly kills the child now,
> doesn't appear like it's sending crash reports for those child processes
> though.

In my nightly build we currently shut the child process down using QuickExit here - 

http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2247

XPCOM shutdown won't get called. 

The series of events in each process look like:

mozilla::dom::ContentParent::ShutDownProcess
mozilla::dom::ContentChild::RecvShutdown
mozilla::dom::ContentParent::RecvFinishShutdown
mozilla::dom::ContentParent::ShutDownProcess
mozilla::dom::ContentParent::ActorDestroy
mozilla::dom::ContentChild::ActorDestroy
^ child process terminated
mozilla::dom::ContentParent::ShutDownProcess
mozilla::dom::ContentParent::ShutDownMessageManager
mozilla::dom::ContentParent::~ContentParent

However we're still seeing hangs in Nightly in CompositorBridgeParent::ShutDown().

https://crash-stats.mozilla.com/search/?product=Firefox&version=49.0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

https://crash-stats.mozilla.com/report/index/87de9e85-93c0-4100-86bf-c3d332160517#allthreads

Bas, any thoughts here?
Flags: needinfo?(bas)
(In reply to Jim Mathies [:jimm] from comment #52)
> Here's a fixed crashstats link, not sure what's going wrong on crash stats
> with permalinks - 
> 
> https://crash-stats.mozilla.com/search/?product=Firefox&version=49.
> 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingl
> eObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla%
> 3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_
> columns=date&_columns=signature&_columns=product&_columns=version&_columns=bu
> ild_id&_columns=platform#crash-reports

There's a mixture of main thread stacks here. Some of which are gfx shutdown related. I'll try to generate a breakdown, and go back and check the other signatures as well. It's possible something did land that improved this.
60% of these are gfx shutdown related, about 12% are necko related, and the rest are a mixed bag of weird shutdown hangs.

At 60% the gfx portion of this is still in the top 10 on nightly, so I think we still have some work to do here.

(I'm searching over the last three days for these stats.)
Crash Signature: shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang… → shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextE…
(In reply to Jim Mathies [:jimm] from comment #55)
> Here's a breakdown, open the proto signature tab:
> 
> https://crash-stats.mozilla.com/search/?product=Firefox&version=49.
> 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingl
> eObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla%
> 3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_
> facets=proto_signature&_columns=date&_columns=signature&_columns=product&_col
> umns=version&_columns=build_id&_columns=platform#facet-proto_signature

So theoretically somehow the child process is still being kept alive. So the one user I have that can reproduce the problem, did indeed still have his child process running when we got shutdown killed. I've asked him to create another video with the latest nightly to see if he can still reproduce the issue of the child process being frozen.
Flags: needinfo?(bas)
(In reply to Jim Mathies [:jimm] from comment #51)
> (In reply to Bas Schouten (:bas.schouten) from comment #47)
> > (In reply to Jonathan Howard from comment #45)
> > > From what I can tell (if code below is run); The child(ren) should be killed
> > > if this lasts more than 5 seconds, if not exited normally in time.
> > > 
> > > https://dxr.mozilla.org/mozilla-central/rev/
> > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749
> > > https://dxr.mozilla.org/mozilla-central/rev/
> > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108
> > > https://dxr.mozilla.org/mozilla-central/rev/
> > > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864
> > 
> > So, currently on trunk, yes, it seems like the children are killed very hard
> > and fast indeed. I don't see this crash occurring after the 12th anymore
> > either. Presumably something landed that solidly kills the child now,
> > doesn't appear like it's sending crash reports for those child processes
> > though.
> 
> In my nightly build we currently shut the child process down using QuickExit
> here - 
> 
> http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2247
> 
> XPCOM shutdown won't get called. 
> 
> The series of events in each process look like:
> 
> mozilla::dom::ContentParent::ShutDownProcess
> mozilla::dom::ContentChild::RecvShutdown
> mozilla::dom::ContentParent::RecvFinishShutdown
> mozilla::dom::ContentParent::ShutDownProcess
> mozilla::dom::ContentParent::ActorDestroy
> mozilla::dom::ContentChild::ActorDestroy
> ^ child process terminated
> mozilla::dom::ContentParent::ShutDownProcess
> mozilla::dom::ContentParent::ShutDownMessageManager
> mozilla::dom::ContentParent::~ContentParent
> 
> However we're still seeing hangs in Nightly in
> CompositorBridgeParent::ShutDown().
> 
> https://crash-stats.mozilla.com/search/?product=Firefox&version=49.
> 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang+%7C+WaitForSingleObj
> ectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%
> 3AWait+%7C+nsEventQueue%3A%3AGetEvent&_facets=signature&_columns=date&_column
> s=signature&_columns=product&_columns=version&_columns=build_id&_columns=plat
> form#crash-reports
> 
> https://crash-stats.mozilla.com/report/index/87de9e85-93c0-4100-86bf-
> c3d332160517#allthreads
> 
> Bas, any thoughts here?

I must note, although it shouldn't be the cause in this bug.. there's some danger in not going through XPCOM shutdown.. as far as I know we don't do any automated leak testing for the parent process where the child process doesn't properly clean up after itself.

If this is what we're going to do it's probably best if we test this configuration for leaks in the parent, somehow.
(In reply to Bas Schouten (:bas.schouten) from comment #56)
> (In reply to Jim Mathies [:jimm] from comment #55)
> > Here's a breakdown, open the proto signature tab:
> > 
> > https://crash-stats.mozilla.com/search/?product=Firefox&version=49.
> > 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingl
> > eObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla%
> > 3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_
> > facets=proto_signature&_columns=date&_columns=signature&_columns=product&_col
> > umns=version&_columns=build_id&_columns=platform#facet-proto_signature
> 
> So theoretically somehow the child process is still being kept alive. So the
> one user I have that can reproduce the problem, did indeed still have his
> child process running when we got shutdown killed. I've asked him to create
> another video with the latest nightly to see if he can still reproduce the
> issue of the child process being frozen.

I should add to that there's one other option, namely inside IPDL ActorDestroy not being properly called for an 'unexpectedly closing channel' which is what the QuickExit child termination will likely cause. In any case, we'll see if the child process stays alive for the user who had this issue.
(In reply to Jim Mathies [:jimm] from comment #51)
> (In reply to Bas Schouten (:bas.schouten) from comment #47)
> > (In reply to Jonathan Howard from comment #45)
> > > From what I can tell (if code below is run); The child(ren) should be killed
> > > if this lasts more than 5 seconds, if not exited normally in time.
> > > 
> > > https://dxr.mozilla.org/mozilla-central/rev/
> > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749
> > > https://dxr.mozilla.org/mozilla-central/rev/
> > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108
> > > https://dxr.mozilla.org/mozilla-central/rev/
> > > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864
> > 
> > So, currently on trunk, yes, it seems like the children are killed very hard
> > and fast indeed. I don't see this crash occurring after the 12th anymore
> > either. Presumably something landed that solidly kills the child now,
> > doesn't appear like it's sending crash reports for those child processes
> > though.
> 
> In my nightly build we currently shut the child process down using QuickExit
> here - 
> 
> http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2247
> 
> XPCOM shutdown won't get called. 

Interestingly another note on this, is that if XPCOM shutdown isn't run and we just roughly exit.. bug 1262898 basically becomes a no-op as far as CompositorBridgeParent::Shutdown() in the parent is concerned (I did change when GeckoChildProcessHost gets destroyed but I don't really see how that could change when CompositorBridgeParent::Shutdown() gets called).. which makes the regression somewhat odd.
I've been able to reproduce this with a sleep statement in ContentChild::ActorDestroy. This simulates a long, slow teardown in PContentChild::OnChannelClose(). ContentParent shuts down cleanly such that the parent side stack matches up to what we're seeing on crashstats. The browser main thread gets caught waiting on gfx ipc channels to shutdown but the core problem isn't related to gfx.

Bas has a contributor who captured a video of this happening - 

https://dl.dropboxusercontent.com/u/95157096/85f61cf7/hp9zrgzdlw.mp4

The interesting thing is is that in order to get into a situation where ContentParent is completely torn down ContentParent needs to receive an acknowledgment from the child that it knows shutdown is taking place. Here's the sequence:

mozilla::dom::ContentParent::ShutDownProcess
  ContentParent calls SendShutdown()
mozilla::dom::ContentChild::RecvShutdown
mozilla::dom::ContentParent::RecvFinishShutdown
mozilla::dom::ContentParent::ShutDownProcess
  ContentParent calls Close()
mozilla::dom::ContentParent::ActorDestroy

The ContentParent call to Close() on MessageChannel sends an async message to the child that kicks off ContentChild destruction. This is where the child gets hung up, but ContentParent is free to tear after this point.

I think we can fix this by creating a timeout in ContentChild::RecvShutdown that terminates the child after a set period. I'm wondering though if this should run on a non-main thread, if gecko events aren't getting delivered, a main thread timer may not fire.

We do not know yet what's is holding up child teardown. bug 1270172 might help.
Bill, any thought on the best way to work around this?
Flags: needinfo?(wmccloskey)
Flags: needinfo?(wmccloskey)
Attachment #8754942 - Flags: review?(wmccloskey)
https://crash-stats.mozilla.com/report/index/96b0184c-ee33-48fa-b2fe-5af732160522#allthreads
and 
https://crash-stats.mozilla.com/report/index/c7392e05-8479-4fc6-896d-00f2c2160522#allthreads

are two out of three shutdown crashes I had on update in the past three days.

Also happens on non-update shutdown, https://crash-stats.mozilla.com/report/index/bee53a5a-6810-4879-bc06-47e922160522#allthreads.

I guess something between the Nightly of the 18th and 19th made this shutdown hang show up at 100% for me.
(In reply to Axel Hecht [:Pike] from comment #64)
> https://crash-stats.mozilla.com/report/index/96b0184c-ee33-48fa-b2fe-
> 5af732160522#allthreads
> and 
> https://crash-stats.mozilla.com/report/index/c7392e05-8479-4fc6-896d-
> 00f2c2160522#allthreads
> 
> are two out of three shutdown crashes I had on update in the past three days.
> 
> Also happens on non-update shutdown,
> https://crash-stats.mozilla.com/report/index/bee53a5a-6810-4879-bc06-
> 47e922160522#allthreads.
> 
> I guess something between the Nightly of the 18th and 19th made this
> shutdown hang show up at 100% for me.

Could you try breaking the child process and getting us a stack for that?
As far as underlying causes are concerned, it should be noted bug 1264694 seems to have fixed a try server error similar to this (bug 1268332). Indeed it seems that since that landed this crash has gone down somewhat (although it's tricky to be sure, it's definitely still happened). But some more time should get us some more data.
Comment on attachment 8754942 [details] [diff] [review]
child side shutdown timer

Review of attachment 8754942 [details] [diff] [review]:
-----------------------------------------------------------------

Whatever we do here should be in the parent. As you said, the timer may never run if the child is really bogged down. A thread would be better, but doing it in the parent would be best.

Bug 1262898 appears to have regressed this. Before that patch, I believe that ProcessWatcher::EnsureProcessTerminated should have killed the child process after 2 seconds. That happened from the GeckoChildProcessHost destructor. With that patch, we don't run that destructor until after we wait for layers to shut down, which can take a long time.

Bas, in bug 1262898 comment 1, you said that "GeckoChildProcessHost deletion will block the IO thread". I can see that we'll block when shutting down the I/O thread, but it seems like that happens much later in shutdown than when we shutdown layers stuff. I must be misunderstanding something. Maybe we could meet to talk about this tomorrow?
Attachment #8754942 - Flags: review?(wmccloskey) → review-
Needinfo to Bas; see above comment.
Flags: needinfo?(bas)
(In reply to Bill McCloskey (:billm) from comment #67)
> Comment on attachment 8754942 [details] [diff] [review]
> child side shutdown timer
> 
> Review of attachment 8754942 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Whatever we do here should be in the parent. As you said, the timer may
> never run if the child is really bogged down. A thread would be better, but
> doing it in the parent would be best.
> 
> Bug 1262898 appears to have regressed this. Before that patch, I believe
> that ProcessWatcher::EnsureProcessTerminated should have killed the child
> process after 2 seconds. That happened from the GeckoChildProcessHost
> destructor. With that patch, we don't run that destructor until after we
> wait for layers to shut down, which can take a long time.
> 
> Bas, in bug 1262898 comment 1, you said that "GeckoChildProcessHost deletion
> will block the IO thread". I can see that we'll block when shutting down the
> I/O thread, but it seems like that happens much later in shutdown than when
> we shutdown layers stuff. I must be misunderstanding something. Maybe we
> could meet to talk about this tomorrow?

Yes, layers shutdown occurs before GeckoChildProcessHost destruction, it has to. If we need the timer in the parent we'll somehow have to initiate it before layers shutdown, it can then force kill it off the main thread in the parent if it takes too long.

What the patches in bug 1262898 do is wait with killing the GeckoChildProcessHost until all IPDL actors are destroyed (It used to be scheduled for destruction as soon as ContentParent::ActorDestroy ran). Before that we were just relying on killing the parent and never cleaning up IPDL actors that weren't on the main thread properly (they would just get destroyed on the child process terminating).

It might be hard to find a time tomorrow. But Wednesday I should be fine most of the day.
Flags: needinfo?(bas)
I did a shutdown, and sampled the chrome and the content process.

Note, neither process uses cpu significantly during the hang, and the child process never dies.

I actually realized last night that I've got 4 process from previous hangs still running, after OSX complained about the lack of memory.

This is the chrome process sample.
... and the sample of the content process.
(In reply to Axel Hecht [:Pike] from comment #70)
> Created attachment 8755802 [details]
> Sample of Nightly Chrome.txt
> 
> I did a shutdown, and sampled the chrome and the content process.
> 
> Note, neither process uses cpu significantly during the hang, and the child
> process never dies.
> 
> I actually realized last night that I've got 4 process from previous hangs
> still running, after OSX complained about the lack of memory.
> 
> This is the chrome process sample.

Can you break the child process so we just have call stacks for all the threads?
Attached file bt all in content
So this depends on profile, too. With the temporary debug profile, everything is fine.

This is `bt all` from my local build, with my usual profile.
See bug 1275524. Spiking of this signature may also be related to AMD driver updates.
(In reply to Axel Hecht [:Pike] from comment #74)
> Created attachment 8756368 [details]
> bt all in content
> 
> So this depends on profile, too. With the temporary debug profile,
> everything is fine.
> 
> This is `bt all` from my local build, with my usual profile.

Hmm... this child doesn't even appear to be in the process of shutting down.
Attached patch QuickExit sooner (obsolete) — Splinter Review
This patch tries to fix the bug that Jim identified in comment 60. We still don't know if that's the real issue here, but this patch should fix it. The bad sequence Jim found is as follows:

1. Parent sends Shutdown message.
2. Child receives this message, does some cleanup, and sends FinishShutdown.
3. Parent receives FinishShutdown, cancels the KillHard timer, and Closes the IPC channel.
4. Child gets an ActorDestroy notification when the channel is closed and does QuickExit.

In between steps 3 and 4, the child could decide to do more work in its event loop that could block for a long time. The parent's KillHard timer has already been canceled at that point.

This patch avoids that by moving the QuickExit to the end of RecvShutdown. The QuickExit happens on the I/O thread to ensure that it happens after any messages sent by RecvShutdown are actually sent.

I do think we should restore the way that EnsureProcessTerminated was called before bug 1262898. However, Bas pointed out to me today that this function is blocking on Windows. That's bad for a lot of reasons. I'll look into that issue and hopefully we can fix it separately.
Attachment #8756629 - Flags: review?(dvander)
Attachment #8756629 - Flags: review?(dvander) → review+
(In reply to Bill McCloskey (:billm) from comment #77)
> This patch avoids that by moving the QuickExit to the end of RecvShutdown.
> The QuickExit happens on the I/O thread to ensure that it happens after any
> messages sent by RecvShutdown are actually sent.

One concern I had with an earl(ier) termination was a loss of telemetry data. My understanding is that we send this data at content process shutdown, not sure when though.
Depends on: 1274725
Blocks: 1276330
Blocks: 1274725
No longer depends on: 1274725
No longer blocks: 1274725
Duplicate of this bug: 1274725
Here's an update on this bug:

The patch I posted won't work on Windows. I looked into how named pipes handle the termination of one side. It looks like any data remaining in the pipe won't necessarily be received by the other side. So I don't think that patch is safe.

I was able to get a better idea of what's going on, based on the STR in bug 1274725. One open question was how we could end up hanging for so long in the child. It turns out it can happen as follows:

1. Child has an "unload" handler that sends a sync XHR. This happens off of TabChild::RecvDestroy. It spins a nested event loop.
2. The child receives the PContent Shutdown message from within this nested event loop and responds to it.
3. The parent Close()s the PContent IPC channel, which normally would trigger ActorDestroy in the child and cause it to die.
4. However, we will never invoke ActorDestroy when there is IPDL code on the stack. In this case there is: the TabChild::RecvDestroy handler is still running. So we keep postponing running ActorDestroy.
5. Since the parent is no longer connected to the child, there's no chance for the sync XHR to finish, so the child spins forever. ActorDestroy never runs, and it never exits.
6. Consequently, the parent sits forever waiting for the compositor to shut down.

My preferred solution here is to make EnsureProcessTerminated be non-blocking on Windows and back out the patch for bug 1262898. However, I'm having trouble getting that through try server. If I can't make it work in the next day or so, I'll prepare some sort of patch that calls EnsureProcessTerminated only in opt builds. That should work around the testing issues I think.
(In reply to Bill McCloskey (:billm) from comment #80)
> My preferred solution here is to make EnsureProcessTerminated be
> non-blocking on Windows and back out the patch for bug 1262898. However, I'm
> having trouble getting that through try server. If I can't make it work in
> the next day or so, I'll prepare some sort of patch that calls
> EnsureProcessTerminated only in opt builds. That should work around the
> testing issues I think.

How does making EnsureProcessTerminated be non-blocking fix the case of the hung child process?

(If we're hung spinning an event loop in the child, we could use that shutdown kill timer patch.)
Would running the timer's callback (mForceKillTimer) in ContentParent destructor be a reasonably quick fix? (As opposed to alternative of having it live longer than ContentParent.)

(Assumes the child hang then is safe left to just join the list of other IPCError-browser | ShutDownKill )
Hrm, I'm still confused why the correct solution isn't to have a separate watchdog for the child started on its own thread. Which just gives the child <x> amount of time and then shuts it down. It seems like that would be the easiest way to guarantee the child always dies within a certain amount of time after a shutdown is sent. And in cases of a child that isn't hanging it allows all of shutdown to occur neatly.
Comment on attachment 8754942 [details] [diff] [review]
child side shutdown timer

OK, let's land this as a stopgap and backport it to Aurora. It looks like it will probably fix most of these problems. I'll continue to work on getting EnsureProcessTerminated to work the way I want.
Attachment #8754942 - Flags: review- → review+
The reason I like the EnsureProcessTerminated solution is that it works no matter what craziness is happening in the child: it could be in an infinite loop or exploited by hackers and EnsureProcessTerminated will still kill it. In addition, the parent needs to call EnsureProcessTerminated regardless of what we do here, so we might as well call it at the right time so that it kills the child when necessary.

I've found one bug so far that was causing some tryserver issues. We seem to be killing the child in DEBUG builds from the ChildProcessHost destructor, which is bad. Hopefully that will resolve all the issues I'm seeing.
(In reply to Bill McCloskey (:billm) from comment #84)
> Comment on attachment 8754942 [details] [diff] [review]
> child side shutdown timer
> 
> OK, let's land this as a stopgap and backport it to Aurora. It looks like it
> will probably fix most of these problems. I'll continue to work on getting
> EnsureProcessTerminated to work the way I want.

Sounds good, I'll land it.

Is there a content data loss risk here? Sounds like it - the data these sites are send in that sync xhr request might be lost. I think we've discussed issues with sync xhr on tab close in the past but I don't remember our decision on it.
Keywords: leave-open
Pushed by jmathies@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3723ba4d368b
Add a safety shutdown timer in ContentChild. r=billm
Attachment #8756629 - Attachment is obsolete: true
Here's a summary of this bug.

When we invoke the GeckoChildProcessHost destructor, it calls
EnsureProcessTerminated. In opt builds, that function kills the child
after 2 seconds. In debug builds, it waits forever.

Before bug 1262898, we deleted the GeckoChildProcessHost in
ContentParent::ActorDestroy. That caused an issue because:

1. On Windows, in debug builds, EnsureProcessTerminated synchronously
waits on the I/O thread for the child to die.
2. ContentParent::ActorDestroy happens before
ShutdownLayersIPC(). When the main thread runs ShutdownLayersIPC(),
the parent spins waiting for the child to tell it that layers have
shut down. However, it will never get that message because the I/O
thread is synchronously blocked waiting for the child to die.

Bug 1262898 fixed that problem by deleting GeckoChildProcessHost
later--after ShutdownLayersIPC. However, this means that the 2 second
kill timer in opt builds won't be started until after
ShutdownLayersIPC synchronously waits for the compositor to shut
down. If the child is stuck for some reason, it can wait forever and
we have no kill timer.

My approach is to go back to deleting the GeckoChildProcessHost from
ContentParent::ActorDestroy. However, I'm going to make
EnsureProcessTerminated be non-blocking on Windows. (It's already
non-blocking on POSIX platforms.) This patch mostly just copies the
POSIX EnsureProcessTerminated implementation to Windows but uses Win32
calls instead of POSIX.
Assignee: bas → wmccloskey
Status: NEW → ASSIGNED
Attachment #8759401 - Flags: review?(dvander)
It turns out that we actually have two EnsureProcessTerminated calls. One happens in ~GeckoChildProcessHost and the other in ~ChildProcessHost. Since GeckoChildProcessHost inherits from ChildProcessHost, they both run. Due to some other weirdness, the second one doesn't run on POSIX platforms because handle() is null. Additionally, the second call doesn't distinguish between DEBUG and opt builds.

This patch removes the second call.
Attachment #8759402 - Flags: review?(dvander)
This patch reverts things so we destroy the GeckoChildProcessHost in ContentParent::ActorDestroy.
Attachment #8759403 - Flags: review?(dvander)
In working on this I found a ton of dead code. I'll remove that in bug 1277705.
Comment on attachment 8754942 [details] [diff] [review]
child side shutdown timer

Approval Request Comment
[Feature/regressing bug #]:
bug 1264694
[User impact if declined]:
hanging child processes on shutdown
[Describe test coverage new/current, TreeHerder]:
on mc 6-3 build, crash stats results look good
[Risks and why]: 
low
[String/UUID change made/needed]:
none
Attachment #8754942 - Flags: approval-mozilla-aurora?
removing a duplicate signature
Crash Signature: shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextE… → shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::layers::CompositorBridgeParent::ShutDown]
Besides all the issues with kill timers not working here, the way we get to these stuck content processes is via a nested event loop. See the numbered list in comment 80 for a description of how that happens.

I think a reasonable way to handle this is to postpone handling the Shutdown message until we're not in a nested event loop. This patch does that (although a bit hackily--perhaps there's a better way?).

I also considered making the guarantee that, as long as a protocol never uses RPC messages, we'll never run the handler for a normal priority IPDL message when another handler is already on the stack. This would be a bit tricky because MessageChannel doesn't know whether the protocol uses RPC. I also wonder if it would break things. The only message that I know of that we expect to spin a nested event loop is CreateWindow (so it can create a new XUL window), and I think it would be fine not to process any other messages during that time. But it's still a bit risky.

Another approach would be to take this patch and then make the larger change (from the previous paragraph) in a separate bug. I think we might want to backport this patch since it sorta affects web compat (since we're failing to do these sync XHRs). The larger change we probably shouldn't backport.
Attachment #8759907 - Flags: review?(benjamin)
Comment on attachment 8754942 [details] [diff] [review]
child side shutdown timer

Fix a shutdown hang, taking it
Attachment #8754942 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Moving remaining work over to tracking+ p1.
Priority: -- → P1
No longer blocks: 1276330
Duplicate of this bug: 1276330
Blocks: 1277582, 1277961
I started having shutdown problems on Nightly (OS X), about at the same time this fix landed on m-c, not sure if it's related.

It happens on both quitting and restarting for an update: the Nightly Web Content process goes up to 100% CPU, Nightly is marked as not responding. I have to force quite the Content process to actually quit Nightly. No request to send crash reports after I restart, no crashes in about:crashes

I tried with a clean profile and it doesn't happen. Any suggestion on how to debug this to file a useful bug?
Comment on attachment 8759907 [details] [diff] [review]
wait for nested event loops to finish at shutdown

This is really ugly, but it's a reasonable short-term bandaid. Although I think we should file/consider a followup world where we either manage to remove this method altogether, or force-break out of any nested event loops in this kind of situation.

My understanding is that this will be pretty common if there are tab-modal alert()s active during shutdown.
Attachment #8759907 - Flags: review?(benjamin) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c34a1f781d9
Wait to finish nested event loops before shutdown (r=bsmedberg)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #101)
> Comment on attachment 8759907 [details] [diff] [review]
> wait for nested event loops to finish at shutdown
> 
> This is really ugly, but it's a reasonable short-term bandaid. Although I
> think we should file/consider a followup world where we either manage to
> remove this method altogether, or force-break out of any nested event loops
> in this kind of situation.

Well, it's a bit tricky. This seems to be appearing as a sync XHR from an unload event during shutdown. That's actually something we're supposed to support, as far as I know. It looks like people are supposed to move towards navigator.sendBeacon. I have no idea if that even works at all during shutdown (with or without e10s). It seems like it won't.

> My understanding is that this will be pretty common if there are tab-modal
> alert()s active during shutdown.

Not sure about alerts, but sync XHRs seem to be really common.
Comment on attachment 8759401 [details] [diff] [review]
make EnsureProcessTerminated non-blocking on Windows

Review of attachment 8759401 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/chromium/src/chrome/common/process_watcher_win.cc
@@ +42,5 @@
> +      base::CloseProcessHandle(process_);
> +      process_ = 0;
> +
> +      // XXX don't think this is necessary, since destruction can only
> +      // be observed once, but can't hurt

nit: Remove the XXX comment, other watchers seem to just call this so might as well follow suit or delete them all.
Attachment #8759401 - Flags: review?(dvander) → review+
Attachment #8759402 - Flags: review?(dvander) → review+
Comment on attachment 8759403 [details] [diff] [review]
back out bug 1262898

Review of attachment 8759403 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorBridgeParent.cpp
@@ +848,5 @@
>  void
>  CompositorBridgeParent::ActorDestroy(ActorDestroyReason why)
>  {
> +  if (why != NormalShutdown) {
> +    printf("ActorDestroy compositor abnormal\n");

nit: debug printf

::: ipc/chromium/src/chrome/common/process_watcher_win.cc
@@ +115,4 @@
>    if (force) {
>      RefPtr<mozilla::Runnable> task = new ChildReaper(process, force);
> +    loop->PostDelayedTask(task.forget(),
> +                          kWaitInterval);

nit: Looks like this can be one line now.
Attachment #8759403 - Flags: review?(dvander) → review+
That is a different issue. We're taking too long trying to shutdown a networking thread (thread #6). I don't see an existing bug for it, so please file one under Necko.
Comment on attachment 8759907 [details] [diff] [review]
wait for nested event loops to finish at shutdown

Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: data loss if a page tries to save content using a sync XHR, content process could take longer than expected to shut down
[Describe test coverage new/current, TreeHerder]: on m-c for over a week
[Risks and why]: There's some risk here, but the patch is intentionally as simple as possible so that it can be backported. At worst, I think we could end up with more content process shutdown hangs.
[String/UUID change made/needed]: None
Attachment #8759907 - Flags: approval-mozilla-beta?
Attachment #8759907 - Flags: approval-mozilla-aurora?
Comment on attachment 8759907 [details] [diff] [review]
wait for nested event loops to finish at shutdown

Review of attachment 8759907 [details] [diff] [review]:
-----------------------------------------------------------------

This might fix the shutdown hang issue. Take it in beta & aurora.
Attachment #8759907 - Flags: approval-mozilla-beta?
Attachment #8759907 - Flags: approval-mozilla-beta+
Attachment #8759907 - Flags: approval-mozilla-aurora?
Attachment #8759907 - Flags: approval-mozilla-aurora+
Hi Bill & Milan,
Can we follow up after uplifting to beta & aurora and see what will happen based on comment #109?
Flags: needinfo?(wmccloskey)
Flags: needinfo?(milan)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9fb1b2d90690
Make EnsureProcessTerminated non-blocking on Windows (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/7f153687ee15
Remove extra EnsureProcessTerminated (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/a736ae904fda
Go back to ContentParent controlling process shutdown (i.e., backout bug 1262898) (r=dvander)
Flags: needinfo?(wmccloskey)
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/a736ae904fda
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Depends on: 1264073
Setting status-firefox48 back to affected so the new patch shows up on the uplift radar.

Bill, was only the one patch supposed to land on Aurora or were all 3?
Flags: needinfo?(wmccloskey)
The only patches that need backporting are "wait for nested event loops to finish at shutdown" and "child side shutdown timer". They should both be backported as far as possible. It looks like the "wait for nested event loops" patch hasn't landed on beta. Is there a reason for that?
Flags: needinfo?(wmccloskey)
Per comment 116, status-firefox48 was set to fixed from previous uplifts (comment 98), so it wouldn't have shown up on the needs-uplift radar. It should now that it's been set back to affected.
Sorry about that. This just needs a very simple change:

+      MessageLoop::current()->PostDelayedTask(
+        NewRunnableMethod(this, &ContentChild::RecvShutdown), 100);

should become

+      MessageLoop::current()->PostDelayedTask(FROM_HERE,
+        NewRunnableMethod(this, &ContentChild::RecvShutdown), 100);
Flags: needinfo?(wmccloskey) → needinfo?(wkocher)
After uplift to aurora, there are still a few hundred reports per week with these signatures. 
Do you want to follow up on those remaining crashes? None of the signatures are showing open bugs filed.
Flags: needinfo?(wmccloskey)
Can you link to them? I'm not seeing anything on crash-stats.
Flags: needinfo?(wmccloskey) → needinfo?(lhenry)
I just noticed that the "Crash Signatures" field includes some other random stuff. This bug is specifically about CompositorBridgeParent::ShutDown. The other stuff seems to be related to storage and networking shutdown as far as I can tell. We should have other bugs for those things.
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] [@ shutdownha… → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::layers::CompositorBridgeParent::ShutDown]
Thanks Bill!  
Benjamin you mentioned possibly filing followup bugs in comment 101, is there anyone specific who should do that?  I could also look over the other crash signatures to see if they have bugs already filed.
Flags: needinfo?(lhenry) → needinfo?(benjamin)
this patch seemed to have improved firefox for android performance on beta:
https://treeherder.mozilla.org/perf.html#/alerts?id=1699
Bug 1279293 is open and we're discussing followup plans there.
Flags: needinfo?(benjamin)
Flags: needinfo?(milan)
You need to log in before you can comment on or make changes to this bug.