<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 3

•

9 years ago

I looked at about a dozen of these crashes. In almost all of them, the main thread was in mozilla::layers::CompositorBridgeParent::ShutDown(), spinning the event loop or something. This looks like it might be the top crash for the 4/27 Nightly, at least so far.

Component: Async Tooling → Graphics: Layers

Product: Toolkit → Core

Comment 4

•

9 years ago

This other signature is a minor variation, and is also common.

Keywords: topcrash

Updated

•

9 years ago

Summary: [e10s] Firefox crashes at [@ shutdownhang | WaitForSingleObjectEx | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent ] → [e10s] Shutdown hang in mozilla::layers::CompositorBridgeParent::ShutDown()

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Comment 5

•

9 years ago

This is the #1 *and* #2 topcrash in Nightly 20160429030215 and Nightly 20160430030223.

blinky

Comment 6

•

9 years ago

Regression range: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=fa784f0160f0f5c26714eef795a11fe41454e53f&tochange=4c3b5f9bd44a64ff81a49653881c5415b088a504 Regressed by: bug 1262898

Nicholas Nethercote [inactive]

Comment 7

•

9 years ago

Bas, can you please take a look?

Flags: needinfo?(bas)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 8

•

9 years ago

(In reply to blinky from comment #6) > Regression range: > https://hg.mozilla.org/integration/mozilla-inbound/ > pushloghtml?fromchange=fa784f0160f0f5c26714eef795a11fe41454e53f&tochange=4c3b > 5f9bd44a64ff81a49653881c5415b088a504 > > Regressed by: bug 1262898 Ah, presumably this is another IPDL issue where under some circumstances the actor destructions aren't properly called. I'll have a look at this, in the worst case scenario we can decide to leak the Compositor Thread for now, it's better than leaking everything like we used to do. I will have a look at this. Although we'll probably need a reproducible scenario in order to find out what's going wrong in IPDL, I wonder if this happens when a content process crashes.

Assignee: nobody → bas

Flags: needinfo?(bas)

Comment hidden (offtopic)

Updated

•

9 years ago

Whiteboard: ShutDownKill → ShutDownKill [gfx-noted]

Paul Pasca[Away. Please needinfo? Paul Oiegas [:pauloiegasSV]]

Comment 11

•

9 years ago

Removing the "regressionwindow-wanted" keyword based on comment 6.

Keywords: regressionwindow-wanted

Marcia Knous [:marcia]

Updated

•

9 years ago

tracking-firefox48: ? → +

tracking-firefox49: ? → +

Milan Sreckovic [:milan] (needinfo for best results)

Comment 12

•

9 years ago

Hey Milan, this sound serious. Should we block?

Blocks: e10s-gfx

Flags: needinfo?(milan)

Comment 13

•

9 years ago

Sure, but we should confirm whether this is in 48 or just in 49. It landed right on 4/25, I don't know if the branch was made before or after it.

Flags: needinfo?(milan)

Comment 14

•

9 years ago

Bug 1268559 is in 48, per the target milestone flag set by the merge script. Just to confirm, it is in the Aurora repo: https://hg.mozilla.org/releases/mozilla-aurora/rev/4c3b5f9bd44a

Nicholas Nethercote [inactive]

Comment 15

•

9 years ago

> Bug 1268559 is in 48 Sorry, I meant bug 1262898.

Comment 16

•

9 years ago

This is the #1, #2 and #3 top crash in Nightly 20160501030217, with 75 crashes, which is a *lot*.

Comment hidden (offtopic)

Updated

•

9 years ago

Comment hidden (offtopic)

Updated

•

9 years ago

status-firefox46: affected → unaffected

status-firefox47: affected → unaffected

Comment hidden (offtopic)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Updated

•

9 years ago

tracking-e10s: ? → m9+

Updated

•

9 years ago

No longer blocks: shutdownkill

Milan Sreckovic [:milan] (needinfo for best results)

Comment 37

•

9 years ago

See bug 1270172 comment 5

Depends on: 1270172

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 38

•

9 years ago

(In reply to blinky from comment #6) > Regression range: > https://hg.mozilla.org/integration/mozilla-inbound/ > pushloghtml?fromchange=fa784f0160f0f5c26714eef795a11fe41454e53f&tochange=4c3b > 5f9bd44a64ff81a49653881c5415b088a504 > > Regressed by: bug 1262898 For what it's worth, if I look at the history of these crashes on mozilla-central nightlies: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2015-10-01&signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent#aggregations I see two regression points. They started in build 20160325085113 (although there wasn't quite a report every day, so it's not quite a reliable one-day range), and then got a lot worse around a month later (the above quoted regression range). The one-day window for the first regression would be: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6202ade0e6d688ffb67932398e56cfc6fa04ceb3&tochange=d5f3da0cfe7ccf846c354014c9b059fad6ba0de5 but a larger 2 day window would be: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=efe7d026ac641759838dd3897c37892e37e5b244&tochange=d5f3da0cfe7ccf846c354014c9b059fad6ba0de5 (I wonder if bug 1258479 is possible?)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 39

•

9 years ago

Milan, your comment 37 here indicates we're waiting on debugging features before we fix this. This is currently an M9 though and is blocking e10s. It's also the #1 top crash on aurora right now. Do we need to prioritize bug 1270172? It's currently tracking plus P3.

Flags: needinfo?(milan)

Comment 40

•

9 years ago

We anticipate that the work for this is in bug 1264694, and we will try to get somewhere without bug 1270172, but we really could use that information. Right now it's been more than a week of actively looking at this problem with limited success.

Flags: needinfo?(milan)

Comment 41

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #40) > We anticipate that the work for this is in bug 1264694, and we will try to > get somewhere without bug 1270172, but we really could use that information. > Right now it's been more than a week of actively looking at this problem > with limited success. For the record, at this point we've verified bug 1264694 is a media issue which most likely -not- the issue we're seeing here (for a number of reasons). The only way I've been able to reproduce this issue locally is simply create a load of work for the child process that simply takes longer than the shutdown kill timer (i.e. just load 40 complicated to load tabs and close the browser). This is perfectly possible, but it's hard to know if it's in any way related to what these users are having their child processes hung in. For what it's worth, if this -is- the cause, we should see bug 1262898 causing this -signature- to regress, but not overall crashiness, as previously these users would also have been hanging, just with a different stack trace. (I'll leave it to people more adapt at analyzing crash stats to determine if this is true this).

Comment 42

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #41) > (In reply to Milan Sreckovic [:milan] from comment #40) > > We anticipate that the work for this is in bug 1264694, and we will try to > > get somewhere without bug 1270172, but we really could use that information. > > Right now it's been more than a week of actively looking at this problem > > with limited success. > > For the record, at this point we've verified bug 1264694 is a media issue > which most likely -not- the issue we're seeing here (for a number of > reasons). > > The only way I've been able to reproduce this issue locally is simply create > a load of work for the child process that simply takes longer than the > shutdown kill timer (i.e. just load 40 complicated to load tabs and close > the browser). This is perfectly possible, but it's hard to know if it's in > any way related to what these users are having their child processes hung > in. For what it's worth, if this -is- the cause, we should see bug 1262898 > causing this -signature- to regress, but not overall crashiness, as > previously these users would also have been hanging, just with a different > stack trace. (I'll leave it to people more adapt at analyzing crash stats to > determine if this is true this). Which shutdown kill timer? The main process hang monitor or the ContentParent force kill timer? Looking at the signatures here it looks like the hang is in CompositorBridgeParent::ShutDown() during XPCOM destruction. This happen long after the last ContentParent is destroyed and the associated ContentChild process is gone. Or are there other signatures lurking in crashreports that I'm missing?

Updated

•

9 years ago

Whiteboard: ShutDownKill [gfx-noted] → [gfx-noted]

Jonathan Howard

Updated

•

9 years ago

Depends on: 1272911

Marco Castelluccio [:marco]

Updated

•

9 years ago

Flags: needinfo?(bas)

Updated

•

9 years ago

Comment 44

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #42) > (In reply to Bas Schouten (:bas.schouten) from comment #41) > > (In reply to Milan Sreckovic [:milan] from comment #40) > > > We anticipate that the work for this is in bug 1264694, and we will try to > > > get somewhere without bug 1270172, but we really could use that information. > > > Right now it's been more than a week of actively looking at this problem > > > with limited success. > > > > For the record, at this point we've verified bug 1264694 is a media issue > > which most likely -not- the issue we're seeing here (for a number of > > reasons). > > > > The only way I've been able to reproduce this issue locally is simply create > > a load of work for the child process that simply takes longer than the > > shutdown kill timer (i.e. just load 40 complicated to load tabs and close > > the browser). This is perfectly possible, but it's hard to know if it's in > > any way related to what these users are having their child processes hung > > in. For what it's worth, if this -is- the cause, we should see bug 1262898 > > causing this -signature- to regress, but not overall crashiness, as > > previously these users would also have been hanging, just with a different > > stack trace. (I'll leave it to people more adapt at analyzing crash stats to > > determine if this is true this). > > Which shutdown kill timer? The main process hang monitor or the > ContentParent force kill timer? > > Looking at the signatures here it looks like the hang is in > CompositorBridgeParent::ShutDown() during XPCOM destruction. This happen > long after the last ContentParent is destroyed and the associated > ContentChild process is gone. Or are there other signatures lurking in > crashreports that I'm missing? This happens after the ContentParent is destroyed, but -not- after the child process is gone. Since the CompositorBridge and TextureBridge are still alive and we'll leave the child alive until those are shutdown. So no, the child process is -not- shutdown yet when we get to this, we are waiting for it to shutdown.

Flags: needinfo?(bas)

Jonathan Howard

Comment 45

•

9 years ago

From what I can tell (if code below is run); The child(ren) should be killed if this lasts more than 5 seconds, if not exited normally in time. https://dxr.mozilla.org/mozilla-central/rev/a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 https://dxr.mozilla.org/mozilla-central/rev/a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 https://dxr.mozilla.org/mozilla-central/rev/a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864

Comment 46

•

9 years ago

(In reply to Jonathan Howard from comment #45) > From what I can tell (if code below is run); The child(ren) should be killed > if this lasts more than 5 seconds, if not exited normally in time. > > https://dxr.mozilla.org/mozilla-central/rev/ > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 > https://dxr.mozilla.org/mozilla-central/rev/ > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 > https://dxr.mozilla.org/mozilla-central/rev/ > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864 I don't think this happens when the ContentParent is successfully shutdown, which in this case I believe it has. But last I checked was two weeks ago, I will check again (two weeks ago I could reproduce trivially by just putting an if (XRE_IsContentProcess()) { Sleep(70000); } in XPCOM shutdown.

Comment 47

•

9 years ago

(In reply to Jonathan Howard from comment #45) > From what I can tell (if code below is run); The child(ren) should be killed > if this lasts more than 5 seconds, if not exited normally in time. > > https://dxr.mozilla.org/mozilla-central/rev/ > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 > https://dxr.mozilla.org/mozilla-central/rev/ > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 > https://dxr.mozilla.org/mozilla-central/rev/ > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864 So, currently on trunk, yes, it seems like the children are killed very hard and fast indeed. I don't see this crash occurring after the 12th anymore either. Presumably something landed that solidly kills the child now, doesn't appear like it's sending crash reports for those child processes though.

Comment 48

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #47) > (In reply to Jonathan Howard from comment #45) > > From what I can tell (if code below is run); The child(ren) should be killed > > if this lasts more than 5 seconds, if not exited normally in time. > > > > https://dxr.mozilla.org/mozilla-central/rev/ > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 > > https://dxr.mozilla.org/mozilla-central/rev/ > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 > > https://dxr.mozilla.org/mozilla-central/rev/ > > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864 > > So, currently on trunk, yes, it seems like the children are killed very hard > and fast indeed. I don't see this crash occurring after the 12th anymore > either. Presumably something landed that solidly kills the child now, > doesn't appear like it's sending crash reports for those child processes > though. Not much here that looks suspect - http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=674a552743785c28c75866969aad513bd8eaf6ae&tochange=c3f5e6079284a7b7053c41f05d0fe06ff031db03

Comment 49

•

9 years ago

Windows Nightly crash symbolification broke about a week ago. Maybe that's related to there not being any signatures any more...

eh, I still one of the signatures showing up past the 12th anyway - https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Dshutdownhang+%7C+WaitForSingleObjectEx+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent&signature=%3Dshutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent&signature=%3Dshutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+nsThread%3A%3AProcessNextEvent+%7C+NS_ProcessNextEvent+%7C+mozilla%3A%3Alayers%3A%3ACompositorBridgeParent%3A%3AShutDown&version=49.0a1&build_id=%3E20160512000000&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Comment 50

•

9 years ago

Updated

•

9 years ago

Depends on: 1264694

Comment 51

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #47) > (In reply to Jonathan Howard from comment #45) > > From what I can tell (if code below is run); The child(ren) should be killed > > if this lasts more than 5 seconds, if not exited normally in time. > > > > https://dxr.mozilla.org/mozilla-central/rev/ > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 > > https://dxr.mozilla.org/mozilla-central/rev/ > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 > > https://dxr.mozilla.org/mozilla-central/rev/ > > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864 > > So, currently on trunk, yes, it seems like the children are killed very hard > and fast indeed. I don't see this crash occurring after the 12th anymore > either. Presumably something landed that solidly kills the child now, > doesn't appear like it's sending crash reports for those child processes > though. In my nightly build we currently shut the child process down using QuickExit here - http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2247 XPCOM shutdown won't get called. The series of events in each process look like: mozilla::dom::ContentParent::ShutDownProcess mozilla::dom::ContentChild::RecvShutdown mozilla::dom::ContentParent::RecvFinishShutdown mozilla::dom::ContentParent::ShutDownProcess mozilla::dom::ContentParent::ActorDestroy mozilla::dom::ContentChild::ActorDestroy ^ child process terminated mozilla::dom::ContentParent::ShutDownProcess mozilla::dom::ContentParent::ShutDownMessageManager mozilla::dom::ContentParent::~ContentParent However we're still seeing hangs in Nightly in CompositorBridgeParent::ShutDown(). https://crash-stats.mozilla.com/search/?product=Firefox&version=49.0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A%3AWait+%7C+nsEventQueue%3A%3AGetEvent&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports https://crash-stats.mozilla.com/report/index/87de9e85-93c0-4100-86bf-c3d332160517#allthreads Bas, any thoughts here?

Flags: needinfo?(bas)

Comment 52

•

9 years ago

Here's a fixed crashstats link, not sure what's going wrong on crash stats with permalinks - https://crash-stats.mozilla.com/search/?product=Firefox&version=49.0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingleObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Comment 53

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #52) > Here's a fixed crashstats link, not sure what's going wrong on crash stats > with permalinks - > > https://crash-stats.mozilla.com/search/?product=Firefox&version=49. > 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingl > eObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla% > 3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_ > columns=date&_columns=signature&_columns=product&_columns=version&_columns=bu > ild_id&_columns=platform#crash-reports There's a mixture of main thread stacks here. Some of which are gfx shutdown related. I'll try to generate a breakdown, and go back and check the other signatures as well. It's possible something did land that improved this.

Comment 54

•

9 years ago

60% of these are gfx shutdown related, about 12% are necko related, and the rest are a mixed bag of weird shutdown hangs. At 60% the gfx portion of this is still in the top 10 on nightly, so I think we still have some work to do here. (I'm searching over the last three days for these stats.)

Updated

•

9 years ago

Comment 55

•

9 years ago

Here's a breakdown, open the proto signature tab: https://crash-stats.mozilla.com/search/?product=Firefox&version=49.0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingleObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature

Comment 56

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #55) > Here's a breakdown, open the proto signature tab: > > https://crash-stats.mozilla.com/search/?product=Firefox&version=49. > 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingl > eObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla% > 3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_ > facets=proto_signature&_columns=date&_columns=signature&_columns=product&_col > umns=version&_columns=build_id&_columns=platform#facet-proto_signature So theoretically somehow the child process is still being kept alive. So the one user I have that can reproduce the problem, did indeed still have his child process running when we got shutdown killed. I've asked him to create another video with the latest nightly to see if he can still reproduce the issue of the child process being frozen.

Flags: needinfo?(bas)

Comment 57

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #51) > (In reply to Bas Schouten (:bas.schouten) from comment #47) > > (In reply to Jonathan Howard from comment #45) > > > From what I can tell (if code below is run); The child(ren) should be killed > > > if this lasts more than 5 seconds, if not exited normally in time. > > > > > > https://dxr.mozilla.org/mozilla-central/rev/ > > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 > > > https://dxr.mozilla.org/mozilla-central/rev/ > > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 > > > https://dxr.mozilla.org/mozilla-central/rev/ > > > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864 > > > > So, currently on trunk, yes, it seems like the children are killed very hard > > and fast indeed. I don't see this crash occurring after the 12th anymore > > either. Presumably something landed that solidly kills the child now, > > doesn't appear like it's sending crash reports for those child processes > > though. > > In my nightly build we currently shut the child process down using QuickExit > here - > > http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2247 > > XPCOM shutdown won't get called. > > The series of events in each process look like: > > mozilla::dom::ContentParent::ShutDownProcess > mozilla::dom::ContentChild::RecvShutdown > mozilla::dom::ContentParent::RecvFinishShutdown > mozilla::dom::ContentParent::ShutDownProcess > mozilla::dom::ContentParent::ActorDestroy > mozilla::dom::ContentChild::ActorDestroy > ^ child process terminated > mozilla::dom::ContentParent::ShutDownProcess > mozilla::dom::ContentParent::ShutDownMessageManager > mozilla::dom::ContentParent::~ContentParent > > However we're still seeing hangs in Nightly in > CompositorBridgeParent::ShutDown(). > > https://crash-stats.mozilla.com/search/?product=Firefox&version=49. > 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang+%7C+WaitForSingleObj > ectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+mozilla%3A%3ACondVar%3A% > 3AWait+%7C+nsEventQueue%3A%3AGetEvent&_facets=signature&_columns=date&_column > s=signature&_columns=product&_columns=version&_columns=build_id&_columns=plat > form#crash-reports > > https://crash-stats.mozilla.com/report/index/87de9e85-93c0-4100-86bf- > c3d332160517#allthreads > > Bas, any thoughts here? I must note, although it shouldn't be the cause in this bug.. there's some danger in not going through XPCOM shutdown.. as far as I know we don't do any automated leak testing for the parent process where the child process doesn't properly clean up after itself. If this is what we're going to do it's probably best if we test this configuration for leaks in the parent, somehow.

Comment 58

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #56) > (In reply to Jim Mathies [:jimm] from comment #55) > > Here's a breakdown, open the proto signature tab: > > > > https://crash-stats.mozilla.com/search/?product=Firefox&version=49. > > 0a1&build_id=%3E20160512000000&signature=%3Dshutdownhang%20%7C%20WaitForSingl > > eObjectEx%20%7C%20WaitForSingleObject%20%7C%20PR_WaitCondVar%20%7C%20mozilla% > > 3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent&_facets=signature&_ > > facets=proto_signature&_columns=date&_columns=signature&_columns=product&_col > > umns=version&_columns=build_id&_columns=platform#facet-proto_signature > > So theoretically somehow the child process is still being kept alive. So the > one user I have that can reproduce the problem, did indeed still have his > child process running when we got shutdown killed. I've asked him to create > another video with the latest nightly to see if he can still reproduce the > issue of the child process being frozen. I should add to that there's one other option, namely inside IPDL ActorDestroy not being properly called for an 'unexpectedly closing channel' which is what the QuickExit child termination will likely cause. In any case, we'll see if the child process stays alive for the user who had this issue.

Comment 59

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #51) > (In reply to Bas Schouten (:bas.schouten) from comment #47) > > (In reply to Jonathan Howard from comment #45) > > > From what I can tell (if code below is run); The child(ren) should be killed > > > if this lasts more than 5 seconds, if not exited normally in time. > > > > > > https://dxr.mozilla.org/mozilla-central/rev/ > > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#1749 > > > https://dxr.mozilla.org/mozilla-central/rev/ > > > a884b96685aa13b65601feddb24e5f85ba861561/dom/ipc/ContentParent.cpp#3108 > > > https://dxr.mozilla.org/mozilla-central/rev/ > > > a884b96685aa13b65601feddb24e5f85ba861561/xpcom/build/XPCOMInit.cpp#864 > > > > So, currently on trunk, yes, it seems like the children are killed very hard > > and fast indeed. I don't see this crash occurring after the 12th anymore > > either. Presumably something landed that solidly kills the child now, > > doesn't appear like it's sending crash reports for those child processes > > though. > > In my nightly build we currently shut the child process down using QuickExit > here - > > http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2247 > > XPCOM shutdown won't get called. Interestingly another note on this, is that if XPCOM shutdown isn't run and we just roughly exit.. bug 1262898 basically becomes a no-op as far as CompositorBridgeParent::Shutdown() in the parent is concerned (I did change when GeckoChildProcessHost gets destroyed but I don't really see how that could change when CompositorBridgeParent::Shutdown() gets called).. which makes the regression somewhat odd.

Comment 60

•

9 years ago

I've been able to reproduce this with a sleep statement in ContentChild::ActorDestroy. This simulates a long, slow teardown in PContentChild::OnChannelClose(). ContentParent shuts down cleanly such that the parent side stack matches up to what we're seeing on crashstats. The browser main thread gets caught waiting on gfx ipc channels to shutdown but the core problem isn't related to gfx. Bas has a contributor who captured a video of this happening - https://dl.dropboxusercontent.com/u/95157096/85f61cf7/hp9zrgzdlw.mp4 The interesting thing is is that in order to get into a situation where ContentParent is completely torn down ContentParent needs to receive an acknowledgment from the child that it knows shutdown is taking place. Here's the sequence: mozilla::dom::ContentParent::ShutDownProcess ContentParent calls SendShutdown() mozilla::dom::ContentChild::RecvShutdown mozilla::dom::ContentParent::RecvFinishShutdown mozilla::dom::ContentParent::ShutDownProcess ContentParent calls Close() mozilla::dom::ContentParent::ActorDestroy The ContentParent call to Close() on MessageChannel sends an async message to the child that kicks off ContentChild destruction. This is where the child gets hung up, but ContentParent is free to tear after this point. I think we can fix this by creating a timeout in ContentChild::RecvShutdown that terminates the child after a set period. I'm wondering though if this should run on a non-main thread, if gecko events aren't getting delivered, a main thread timer may not fire. We do not know yet what's is holding up child teardown. bug 1270172 might help.

Comment 61

•

9 years ago

Bill, any thought on the best way to work around this?

Flags: needinfo?(wmccloskey)

Comment 62

•

9 years ago

Attached patch child side shutdown timer — Details — Splinter Review

Flags: needinfo?(wmccloskey)

Attachment #8754942 - Flags: review?(wmccloskey)

Comment 63

•

9 years ago

try build for testing: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab66c1a5acdd

Comment 64

•

9 years ago

https://crash-stats.mozilla.com/report/index/96b0184c-ee33-48fa-b2fe-5af732160522#allthreads and https://crash-stats.mozilla.com/report/index/c7392e05-8479-4fc6-896d-00f2c2160522#allthreads are two out of three shutdown crashes I had on update in the past three days. Also happens on non-update shutdown, https://crash-stats.mozilla.com/report/index/bee53a5a-6810-4879-bc06-47e922160522#allthreads. I guess something between the Nightly of the 18th and 19th made this shutdown hang show up at 100% for me.

Comment 65

•

9 years ago

(In reply to Axel Hecht [:Pike] from comment #64) > https://crash-stats.mozilla.com/report/index/96b0184c-ee33-48fa-b2fe- > 5af732160522#allthreads > and > https://crash-stats.mozilla.com/report/index/c7392e05-8479-4fc6-896d- > 00f2c2160522#allthreads > > are two out of three shutdown crashes I had on update in the past three days. > > Also happens on non-update shutdown, > https://crash-stats.mozilla.com/report/index/bee53a5a-6810-4879-bc06- > 47e922160522#allthreads. > > I guess something between the Nightly of the 18th and 19th made this > shutdown hang show up at 100% for me. Could you try breaking the child process and getting us a stack for that?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 66

•

9 years ago

As far as underlying causes are concerned, it should be noted bug 1264694 seems to have fixed a try server error similar to this (bug 1268332). Indeed it seems that since that landed this crash has gone down somewhat (although it's tricky to be sure, it's definitely still happened). But some more time should get us some more data.

Assignee

Comment 67

•

9 years ago

Comment on attachment 8754942 [details] [diff] [review] child side shutdown timer Review of attachment 8754942 [details] [diff] [review]: ----------------------------------------------------------------- Whatever we do here should be in the parent. As you said, the timer may never run if the child is really bogged down. A thread would be better, but doing it in the parent would be best. Bug 1262898 appears to have regressed this. Before that patch, I believe that ProcessWatcher::EnsureProcessTerminated should have killed the child process after 2 seconds. That happened from the GeckoChildProcessHost destructor. With that patch, we don't run that destructor until after we wait for layers to shut down, which can take a long time. Bas, in bug 1262898 comment 1, you said that "GeckoChildProcessHost deletion will block the IO thread". I can see that we'll block when shutting down the I/O thread, but it seems like that happens much later in shutdown than when we shutdown layers stuff. I must be misunderstanding something. Maybe we could meet to talk about this tomorrow?

Attachment #8754942 - Flags: review?(wmccloskey) → review-

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 68

•

9 years ago

Needinfo to Bas; see above comment.

Flags: needinfo?(bas)

Comment 69

•

9 years ago

(In reply to Bill McCloskey (:billm) from comment #67) > Comment on attachment 8754942 [details] [diff] [review] > child side shutdown timer > > Review of attachment 8754942 [details] [diff] [review]: > ----------------------------------------------------------------- > > Whatever we do here should be in the parent. As you said, the timer may > never run if the child is really bogged down. A thread would be better, but > doing it in the parent would be best. > > Bug 1262898 appears to have regressed this. Before that patch, I believe > that ProcessWatcher::EnsureProcessTerminated should have killed the child > process after 2 seconds. That happened from the GeckoChildProcessHost > destructor. With that patch, we don't run that destructor until after we > wait for layers to shut down, which can take a long time. > > Bas, in bug 1262898 comment 1, you said that "GeckoChildProcessHost deletion > will block the IO thread". I can see that we'll block when shutting down the > I/O thread, but it seems like that happens much later in shutdown than when > we shutdown layers stuff. I must be misunderstanding something. Maybe we > could meet to talk about this tomorrow? Yes, layers shutdown occurs before GeckoChildProcessHost destruction, it has to. If we need the timer in the parent we'll somehow have to initiate it before layers shutdown, it can then force kill it off the main thread in the parent if it takes too long. What the patches in bug 1262898 do is wait with killing the GeckoChildProcessHost until all IPDL actors are destroyed (It used to be scheduled for destruction as soon as ContentParent::ActorDestroy ran). Before that we were just relying on killing the parent and never cleaning up IPDL actors that weren't on the main thread properly (they would just get destroyed on the child process terminating). It might be hard to find a time tomorrow. But Wednesday I should be fine most of the day.

Flags: needinfo?(bas)

Comment 70

•

9 years ago

Attached file Sample of Nightly Chrome.txt — Details

I did a shutdown, and sampled the chrome and the content process. Note, neither process uses cpu significantly during the hang, and the child process never dies. I actually realized last night that I've got 4 process from previous hangs still running, after OSX complained about the lack of memory. This is the chrome process sample.

Comment 71

•

9 years ago

Attached file Sample of Nightly Web Content.txt — Details

... and the sample of the content process.

Comment 72

•

9 years ago

(In reply to Axel Hecht [:Pike] from comment #70) > Created attachment 8755802 [details] > Sample of Nightly Chrome.txt > > I did a shutdown, and sampled the chrome and the content process. > > Note, neither process uses cpu significantly during the hang, and the child > process never dies. > > I actually realized last night that I've got 4 process from previous hangs > still running, after OSX complained about the lack of memory. > > This is the chrome process sample. Can you break the child process so we just have call stacks for all the threads?

Gian-Carlo Pascutto [:gcp]

Comment 73

•

9 years ago

Attached file bt all in content — Details

So this depends on profile, too. With the temporary debug profile, everything is fine. This is `bt all` from my local build, with my usual profile.

Comment 74

•

9 years ago

See bug 1275524. Spiking of this signature may also be related to AMD driver updates.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 76

•

9 years ago

(In reply to Axel Hecht [:Pike] from comment #74) > Created attachment 8756368 [details] > bt all in content > > So this depends on profile, too. With the temporary debug profile, > everything is fine. > > This is `bt all` from my local build, with my usual profile. Hmm... this child doesn't even appear to be in the process of shutting down.

Assignee

Comment 77

•

9 years ago

Attached patch QuickExit sooner (obsolete) — Details — Splinter Review

This patch tries to fix the bug that Jim identified in comment 60. We still don't know if that's the real issue here, but this patch should fix it. The bad sequence Jim found is as follows: 1. Parent sends Shutdown message. 2. Child receives this message, does some cleanup, and sends FinishShutdown. 3. Parent receives FinishShutdown, cancels the KillHard timer, and Closes the IPC channel. 4. Child gets an ActorDestroy notification when the channel is closed and does QuickExit. In between steps 3 and 4, the child could decide to do more work in its event loop that could block for a long time. The parent's KillHard timer has already been canceled at that point. This patch avoids that by moving the QuickExit to the end of RecvShutdown. The QuickExit happens on the I/O thread to ensure that it happens after any messages sent by RecvShutdown are actually sent. I do think we should restore the way that EnsureProcessTerminated was called before bug 1262898. However, Bas pointed out to me today that this function is blocking on Windows. That's bad for a lot of reasons. I'll look into that issue and hopefully we can fix it separately.

Attachment #8756629 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

9 years ago

Attachment #8756629 - Flags: review?(dvander) → review+

Comment 78

•

9 years ago

(In reply to Bill McCloskey (:billm) from comment #77) > This patch avoids that by moving the QuickExit to the end of RecvShutdown. > The QuickExit happens on the I/O thread to ensure that it happens after any > messages sent by RecvShutdown are actually sent. One concern I had with an earl(ier) termination was a loss of telemetry data. My understanding is that we send this data at content process shutdown, not sure when though.

Updated

•

9 years ago

Depends on: 1274725

Updated

•

9 years ago

Blocks: 1276330

Updated

•

9 years ago

Blocks: 1274725

No longer depends on: 1274725

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

9 years ago

No longer blocks: 1274725

Assignee

Comment 80

•

9 years ago

Here's an update on this bug: The patch I posted won't work on Windows. I looked into how named pipes handle the termination of one side. It looks like any data remaining in the pipe won't necessarily be received by the other side. So I don't think that patch is safe. I was able to get a better idea of what's going on, based on the STR in bug 1274725. One open question was how we could end up hanging for so long in the child. It turns out it can happen as follows: 1. Child has an "unload" handler that sends a sync XHR. This happens off of TabChild::RecvDestroy. It spins a nested event loop. 2. The child receives the PContent Shutdown message from within this nested event loop and responds to it. 3. The parent Close()s the PContent IPC channel, which normally would trigger ActorDestroy in the child and cause it to die. 4. However, we will never invoke ActorDestroy when there is IPDL code on the stack. In this case there is: the TabChild::RecvDestroy handler is still running. So we keep postponing running ActorDestroy. 5. Since the parent is no longer connected to the child, there's no chance for the sync XHR to finish, so the child spins forever. ActorDestroy never runs, and it never exits. 6. Consequently, the parent sits forever waiting for the compositor to shut down. My preferred solution here is to make EnsureProcessTerminated be non-blocking on Windows and back out the patch for bug 1262898. However, I'm having trouble getting that through try server. If I can't make it work in the next day or so, I'll prepare some sort of patch that calls EnsureProcessTerminated only in opt builds. That should work around the testing issues I think.

Comment 81

•

9 years ago

(In reply to Bill McCloskey (:billm) from comment #80) > My preferred solution here is to make EnsureProcessTerminated be > non-blocking on Windows and back out the patch for bug 1262898. However, I'm > having trouble getting that through try server. If I can't make it work in > the next day or so, I'll prepare some sort of patch that calls > EnsureProcessTerminated only in opt builds. That should work around the > testing issues I think. How does making EnsureProcessTerminated be non-blocking fix the case of the hung child process? (If we're hung spinning an event loop in the child, we could use that shutdown kill timer patch.)

Jonathan Howard

Comment 82

•

9 years ago

Would running the timer's callback (mForceKillTimer) in ContentParent destructor be a reasonably quick fix? (As opposed to alternative of having it live longer than ContentParent.) (Assumes the child hang then is safe left to just join the list of other IPCError-browser | ShutDownKill )

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 83

•

9 years ago

Hrm, I'm still confused why the correct solution isn't to have a separate watchdog for the child started on its own thread. Which just gives the child <x> amount of time and then shuts it down. It seems like that would be the easiest way to guarantee the child always dies within a certain amount of time after a shutdown is sent. And in cases of a child that isn't hanging it allows all of shutdown to occur neatly.

Assignee

Comment 84

•

9 years ago

Comment on attachment 8754942 [details] [diff] [review] child side shutdown timer OK, let's land this as a stopgap and backport it to Aurora. It looks like it will probably fix most of these problems. I'll continue to work on getting EnsureProcessTerminated to work the way I want.

Attachment #8754942 - Flags: review- → review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 85

•

9 years ago

The reason I like the EnsureProcessTerminated solution is that it works no matter what craziness is happening in the child: it could be in an infinite loop or exploited by hackers and EnsureProcessTerminated will still kill it. In addition, the parent needs to call EnsureProcessTerminated regardless of what we do here, so we might as well call it at the right time so that it kills the child when necessary. I've found one bug so far that was causing some tryserver issues. We seem to be killing the child in DEBUG builds from the ChildProcessHost destructor, which is bad. Hopefully that will resolve all the issues I'm seeing.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 86

•

9 years ago

(In reply to Bill McCloskey (:billm) from comment #84) > Comment on attachment 8754942 [details] [diff] [review] > child side shutdown timer > > OK, let's land this as a stopgap and backport it to Aurora. It looks like it > will probably fix most of these problems. I'll continue to work on getting > EnsureProcessTerminated to work the way I want. Sounds good, I'll land it. Is there a content data loss risk here? Sounds like it - the data these sites are send in that sync xhr request might be lost. I think we've discussed issues with sync xhr on tab close in the past but I don't remember our decision on it.

Keywords: leave-open

Pulsebot

Comment 87

•

9 years ago

Pushed by jmathies@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/3723ba4d368b Add a safety shutdown timer in ContentChild. r=billm

Assignee

Updated

•

9 years ago

Attachment #8756629 - Attachment is obsolete: true

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 88

•

9 years ago

Attached patch make EnsureProcessTerminated non-blocking on Windows — Details — Splinter Review

Here's a summary of this bug. When we invoke the GeckoChildProcessHost destructor, it calls EnsureProcessTerminated. In opt builds, that function kills the child after 2 seconds. In debug builds, it waits forever. Before bug 1262898, we deleted the GeckoChildProcessHost in ContentParent::ActorDestroy. That caused an issue because: 1. On Windows, in debug builds, EnsureProcessTerminated synchronously waits on the I/O thread for the child to die. 2. ContentParent::ActorDestroy happens before ShutdownLayersIPC(). When the main thread runs ShutdownLayersIPC(), the parent spins waiting for the child to tell it that layers have shut down. However, it will never get that message because the I/O thread is synchronously blocked waiting for the child to die. Bug 1262898 fixed that problem by deleting GeckoChildProcessHost later--after ShutdownLayersIPC. However, this means that the 2 second kill timer in opt builds won't be started until after ShutdownLayersIPC synchronously waits for the compositor to shut down. If the child is stuck for some reason, it can wait forever and we have no kill timer. My approach is to go back to deleting the GeckoChildProcessHost from ContentParent::ActorDestroy. However, I'm going to make EnsureProcessTerminated be non-blocking on Windows. (It's already non-blocking on POSIX platforms.) This patch mostly just copies the POSIX EnsureProcessTerminated implementation to Windows but uses Win32 calls instead of POSIX.

Assignee: bas → wmccloskey

Status: NEW → ASSIGNED

Attachment #8759401 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 89

•

9 years ago

Attached patch remove extra EnsureProcessTerminated — Details — Splinter Review

It turns out that we actually have two EnsureProcessTerminated calls. One happens in ~GeckoChildProcessHost and the other in ~ChildProcessHost. Since GeckoChildProcessHost inherits from ChildProcessHost, they both run. Due to some other weirdness, the second one doesn't run on POSIX platforms because handle() is null. Additionally, the second call doesn't distinguish between DEBUG and opt builds. This patch removes the second call.

Attachment #8759402 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 90

•

9 years ago

Attached patch back out bug 1262898 — Details — Splinter Review

This patch reverts things so we destroy the GeckoChildProcessHost in ContentParent::ActorDestroy.

Attachment #8759403 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 91

•

9 years ago

In working on this I found a ton of dead code. I'll remove that in bug 1277705.

https://hg.mozilla.org/mozilla-central/rev/3723ba4d368b

Comment 92

•

9 years ago

bugherder

Comment 93

•

9 years ago

Comment on attachment 8754942 [details] [diff] [review] child side shutdown timer Approval Request Comment [Feature/regressing bug #]: bug 1264694 [User impact if declined]: hanging child processes on shutdown [Describe test coverage new/current, TreeHerder]: on mc 6-3 build, crash stats results look good [Risks and why]: low [String/UUID change made/needed]: none

Attachment #8754942 - Flags: approval-mozilla-aurora?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 94

•

9 years ago

removing a duplicate signature

Assignee

Comment 95

•

9 years ago

Attached patch wait for nested event loops to finish at shutdown — Details — Splinter Review

Besides all the issues with kill timers not working here, the way we get to these stuck content processes is via a nested event loop. See the numbered list in comment 80 for a description of how that happens. I think a reasonable way to handle this is to postpone handling the Shutdown message until we're not in a nested event loop. This patch does that (although a bit hackily--perhaps there's a better way?). I also considered making the guarantee that, as long as a protocol never uses RPC messages, we'll never run the handler for a normal priority IPDL message when another handler is already on the stack. This would be a bit tricky because MessageChannel doesn't know whether the protocol uses RPC. I also wonder if it would break things. The only message that I know of that we expect to spin a nested event loop is CreateWindow (so it can create a new XUL window), and I think it would be fine not to process any other messages during that time. But it's still a bit risky. Another approach would be to take this patch and then make the larger change (from the previous paragraph) in a separate bug. I think we might want to backport this patch since it sorta affects web compat (since we're failing to do these sync XHRs). The larger change we probably shouldn't backport.

Attachment #8759907 - Flags: review?(benjamin)

Sylvestre Ledru [:Sylvestre]

Comment 96

•

9 years ago

Comment on attachment 8754942 [details] [diff] [review] child side shutdown timer Fix a shutdown hang, taking it

Attachment #8754942 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 97

•

9 years ago

Moving remaining work over to tracking+ p1.

tracking-e10s: m9+ → +

Priority: -- → P1

https://hg.mozilla.org/releases/mozilla-aurora/rev/886d96d45721

Comment 98

•

9 years ago

bugherder uplift

status-firefox48: affected → fixed

Valentin Gosu [:valentin]

Updated

•

9 years ago

No longer blocks: 1276330

Updated

•

9 years ago

Updated

•

9 years ago

Blocks: 1278443

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

9 years ago

Blocks: 1277582, 1277961

Francesco Lodolo [:flod]

Comment 100

•

9 years ago

I started having shutdown problems on Nightly (OS X), about at the same time this fix landed on m-c, not sure if it's related. It happens on both quitting and restarting for an update: the Nightly Web Content process goes up to 100% CPU, Nightly is marked as not responding. I have to force quite the Content process to actually quit Nightly. No request to send crash reports after I restart, no crashes in about:crashes I tried with a clean profile and it doesn't happen. Any suggestion on how to debug this to file a useful bug?

Benjamin Smedberg

Comment 101

•

9 years ago

Comment on attachment 8759907 [details] [diff] [review] wait for nested event loops to finish at shutdown This is really ugly, but it's a reasonable short-term bandaid. Although I think we should file/consider a followup world where we either manage to remove this method altogether, or force-break out of any nested event loops in this kind of situation. My understanding is that this will be pretty common if there are tab-modal alert()s active during shutdown.

Attachment #8759907 - Flags: review?(benjamin) → review+

Pulsebot

Comment 102

•

9 years ago

Pushed by wmccloskey@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/9c34a1f781d9 Wait to finish nested event loops before shutdown (r=bsmedberg)

Assignee

Comment 103

•

9 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #101) > Comment on attachment 8759907 [details] [diff] [review] > wait for nested event loops to finish at shutdown > > This is really ugly, but it's a reasonable short-term bandaid. Although I > think we should file/consider a followup world where we either manage to > remove this method altogether, or force-break out of any nested event loops > in this kind of situation. Well, it's a bit tricky. This seems to be appearing as a sync XHR from an unload event during shutdown. That's actually something we're supposed to support, as far as I know. It looks like people are supposed to move towards navigator.sendBeacon. I have no idea if that even works at all during shutdown (with or without e10s). It seems like it won't. > My understanding is that this will be pretty common if there are tab-modal > alert()s active during shutdown. Not sure about alerts, but sync XHRs seem to be really common.

https://hg.mozilla.org/mozilla-central/rev/9c34a1f781d9

Comment 104

•

9 years ago

bugherder

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 105

•

9 years ago

Comment on attachment 8759401 [details] [diff] [review] make EnsureProcessTerminated non-blocking on Windows Review of attachment 8759401 [details] [diff] [review]: ----------------------------------------------------------------- ::: ipc/chromium/src/chrome/common/process_watcher_win.cc @@ +42,5 @@ > + base::CloseProcessHandle(process_); > + process_ = 0; > + > + // XXX don't think this is necessary, since destruction can only > + // be observed once, but can't hurt nit: Remove the XXX comment, other watchers seem to just call this so might as well follow suit or delete them all.

Attachment #8759401 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

9 years ago

Attachment #8759402 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 106

•

9 years ago

Comment on attachment 8759403 [details] [diff] [review] back out bug 1262898 Review of attachment 8759403 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/CompositorBridgeParent.cpp @@ +848,5 @@ > void > CompositorBridgeParent::ActorDestroy(ActorDestroyReason why) > { > + if (why != NormalShutdown) { > + printf("ActorDestroy compositor abnormal\n"); nit: debug printf ::: ipc/chromium/src/chrome/common/process_watcher_win.cc @@ +115,4 @@ > if (force) { > RefPtr<mozilla::Runnable> task = new ChildReaper(process, force); > + loop->PostDelayedTask(task.forget(), > + kWaitInterval); nit: Looks like this can be one line now.

Attachment #8759403 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

9 years ago

Blocks: 1271180

Tobias B. Besemer [:BesTo] (QA)

Comment 107

•

9 years ago

https://crash-stats.mozilla.com/report/index/03d6c12a-d64b-4ab8-8e66-b72ef2160615

status-firefox50: --- → affected

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 108

•

9 years ago

That is a different issue. We're taking too long trying to shutdown a networking thread (thread #6). I don't see an existing bug for it, so please file one under Necko.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 109

•

9 years ago

Comment on attachment 8759907 [details] [diff] [review] wait for nested event loops to finish at shutdown Approval Request Comment [Feature/regressing bug #]: e10s [User impact if declined]: data loss if a page tries to save content using a sync XHR, content process could take longer than expected to shut down [Describe test coverage new/current, TreeHerder]: on m-c for over a week [Risks and why]: There's some risk here, but the patch is intentionally as simple as possible so that it can be backported. At worst, I think we could end up with more content process shutdown hangs. [String/UUID change made/needed]: None

Attachment #8759907 - Flags: approval-mozilla-beta?

Attachment #8759907 - Flags: approval-mozilla-aurora?

Gerry Chang [:gchang]

Comment 110

•

9 years ago

Comment on attachment 8759907 [details] [diff] [review] wait for nested event loops to finish at shutdown Review of attachment 8759907 [details] [diff] [review]: ----------------------------------------------------------------- This might fix the shutdown hang issue. Take it in beta & aurora.

Attachment #8759907 - Flags: approval-mozilla-beta?

Attachment #8759907 - Flags: approval-mozilla-beta+

Attachment #8759907 - Flags: approval-mozilla-aurora?

Attachment #8759907 - Flags: approval-mozilla-aurora+

Gerry Chang [:gchang]

Comment 111

•

9 years ago

Hi Bill & Milan, Can we follow up after uplifting to beta & aurora and see what will happen based on comment #109?

Flags: needinfo?(wmccloskey)

Flags: needinfo?(milan)

https://hg.mozilla.org/releases/mozilla-aurora/rev/0cb189f7380f

Comment 112

•

9 years ago

bugherder uplift

status-firefox49: affected → fixed

Pulsebot

Comment 113

•

9 years ago

Pushed by wmccloskey@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/9fb1b2d90690 Make EnsureProcessTerminated non-blocking on Windows (r=dvander) https://hg.mozilla.org/integration/mozilla-inbound/rev/7f153687ee15 Remove extra EnsureProcessTerminated (r=dvander) https://hg.mozilla.org/integration/mozilla-inbound/rev/a736ae904fda Go back to ContentParent controlling process shutdown (i.e., backout bug 1262898) (r=dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

9 years ago

Flags: needinfo?(wmccloskey)

Keywords: leave-open

Phil Ringnalda (:philor)

Comment 114

•

9 years ago

backout bugherder

https://hg.mozilla.org/mozilla-central/rev/a736ae904fda

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox50: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Phil Ringnalda (:philor)

Comment 115

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9fb1b2d90690 https://hg.mozilla.org/mozilla-central/rev/7f153687ee15

Updated

•

9 years ago

Depends on: 1264073

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 116

•

9 years ago

Setting status-firefox48 back to affected so the new patch shows up on the uplift radar. Bill, was only the one patch supposed to land on Aurora or were all 3?

status-firefox48: fixed → affected

Flags: needinfo?(wmccloskey)

Assignee

Comment 117

•

9 years ago

The only patches that need backporting are "wait for nested event loops to finish at shutdown" and "child side shutdown timer". They should both be backported as far as possible. It looks like the "wait for nested event loops" patch hasn't landed on beta. Is there a reason for that?

Flags: needinfo?(wmccloskey)