Closed Bug 1423261 Opened 6 years ago Closed 6 years ago

Crash in mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent

Categories

(Core :: DOM: Content Processes, defect, P1)

58 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- fixed
firefox57 --- unaffected
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 - wontfix
firefox61 --- fixed

People

(Reporter: philipp, Assigned: spohl)

References

Details

(4 keywords)

Crash Data

Attachments

(3 files, 9 obsolete files)

2.04 KB, patch
jimm
: review+
Details | Diff | Splinter Review
2.03 KB, patch
jimm
: review+
Details | Diff | Splinter Review
4.00 KB, patch
jimm
: review+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is
report bp-0077eff1-a75a-4c26-ac0c-0dd730171205.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll mozilla::ipc::MessageChannel::Clear ipc/glue/MessageChannel.cpp:708
1 xul.dll mozilla::ipc::MessageChannel::~MessageChannel ipc/glue/MessageChannel.cpp:581
2 xul.dll mozilla::dom::PContentParent::~PContentParent ipc/ipdl/PContentParent.cpp:264
3 xul.dll mozilla::dom::ContentParent::cycleCollection::DeleteCycleCollectable dom/ipc/ContentParent.h:316
4 xul.dll SnowWhiteKiller::~SnowWhiteKiller xpcom/base/nsCycleCollector.cpp:2729
5 xul.dll nsCycleCollector::FreeSnowWhite xpcom/base/nsCycleCollector.cpp:2917
6 xul.dll nsCycleCollector::Shutdown xpcom/base/nsCycleCollector.cpp:3987
7 xul.dll nsCycleCollector_shutdown xpcom/base/nsCycleCollector.cpp:4373
8 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:973
9 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1508

=============================================================

crashes during shutdown with this signature are (re)appearing in firefox 58 - the same signature got fixed once before in bug 1363601.
on 58.0b this is currently the #3 browser top crash, all the crashes are from windows and come with MOZ_CRASH(MessageChannel destroyed without being closed).

it first regressed on 58.0a1 build 20171003100226, so a likely pushlog containing the regression would be: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=15f221f491f707b1e8e46da344b6dd5a394b1242&tochange=11fe0a2895aab26c57bcfe61b3041d7837e954cd
Hey Gabor, can you take a quick look here? You fixed this previously, had something to do with preallocated processes.
Flags: needinfo?(gkrizsanits)
(In reply to [:philipp] from comment #0)
> This bug was filed from the Socorro interface and is
> report bp-0077eff1-a75a-4c26-ac0c-0dd730171205.
> =============================================================
> 
> Top 10 frames of crashing thread:
> 
> 0 xul.dll mozilla::ipc::MessageChannel::Clear ipc/glue/MessageChannel.cpp:708
> 1 xul.dll mozilla::ipc::MessageChannel::~MessageChannel
> ipc/glue/MessageChannel.cpp:581
> 2 xul.dll mozilla::dom::PContentParent::~PContentParent
> ipc/ipdl/PContentParent.cpp:264
> 3 xul.dll
> mozilla::dom::ContentParent::cycleCollection::DeleteCycleCollectable
> dom/ipc/ContentParent.h:316
> 4 xul.dll SnowWhiteKiller::~SnowWhiteKiller
> xpcom/base/nsCycleCollector.cpp:2729
> 5 xul.dll nsCycleCollector::FreeSnowWhite
> xpcom/base/nsCycleCollector.cpp:2917
> 6 xul.dll nsCycleCollector::Shutdown xpcom/base/nsCycleCollector.cpp:3987
> 7 xul.dll nsCycleCollector_shutdown xpcom/base/nsCycleCollector.cpp:4373
> 8 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:973
> 9 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup
> toolkit/xre/nsAppRunner.cpp:1508
> 
> =============================================================
> 
> crashes during shutdown with this signature are (re)appearing in firefox 58
> - the same signature got fixed once before in bug 1363601.
> on 58.0b this is currently the #3 browser top crash, all the crashes are
> from windows and come with MOZ_CRASH(MessageChannel destroyed without being
> closed).
> 
> it first regressed on 58.0a1 build 20171003100226, so a likely pushlog
> containing the regression would be:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=15f221f491f707b1e8e46da344b6dd5a394b1242&tochange=11fe
> 0a2895aab26c57bcfe61b3041d7837e954cd

The patch that reenables the preallocated process manager is in this range. However that didn't come with a big change back then, so I'm not convinced that it is related. (note that we have crashes with this signature even on 57 release where the ppm is disabled completely). Something seems to be changed around November 15th-ish: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2017-09-01T13%3A35%3A00.000Z&date=%3C2017-12-07T13%3A35%3A59.000Z#graphs

that's where the crash-rate went up like crazy, I'm not sure why, could not found anything suspicious yet.
Flags: needinfo?(gkrizsanits)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #2)
> Something seems to be changed around November 15th-ish:
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20
> %7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozil
> la%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2017-09-
> 01T13%3A35%3A00.000Z&date=%3C2017-12-07T13%3A35%3A59.000Z#graphs
> 
> that's where the crash-rate went up like crazy, I'm not sure why, could not
> found anything suspicious yet.

that's when 58 was released/pushed to the beta population. 

if you focus in to crashes on nightly you can see they started appearing on october 3rd on that channel (pushlog of the changes is in comment #0): https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2017-09-01T15%3A35%3A00.000Z#graphs
Gabor, does anything in the range from Comment #3 stick out to you?
Flags: needinfo?(gkrizsanits)
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #4)
> Gabor, does anything in the range from Comment #3 stick out to you?

Based on Comment 3, it's probably the preallocated process manager (Bug 1385249)

I've looked into this in the last couple of days, but without being able to reproduce it it's quite hard to tell what's going on. It's a Windows only shutdown crash, and it seems like it can happen without the ppm but the ppm makes it (much) more likely that it happens.  What _should_ make sure that this never happens is that during the shutdown we explicitly spin the event loop and wait for the channels to close for all content processes, including the ppm here:
https://searchfox.org/mozilla-central/rev/f6f1731b1b7fec332f86b55fa40e2c9ae67ac39b/dom/ipc/ContentParent.cpp#2764

and short before that we make sure that no more preallocated process is spawned during the shutdown here:
https://searchfox.org/mozilla-central/rev/f6f1731b1b7fec332f86b55fa40e2c9ae67ac39b/dom/ipc/PreallocatedProcessManager.cpp#139

I discovered that probably after the channel is closed the ppm should release the preallocated process (this is happening a bit later now: https://searchfox.org/mozilla-central/rev/f6f1731b1b7fec332f86b55fa40e2c9ae67ac39b/dom/ipc/PreallocatedProcessManager.cpp#84)

What might also cause some troubles is that while we're waiting for the channel mForceKillTimer might get fired, but I fail to see how would that leave the channel open.
Flags: needinfo?(gkrizsanits)
I guess it's worth a try to release the pp earlier and see if that helps somehow, but probably this will require some more work to understand the underlying real issue around content process shutdown. Another approach is to convert the failing release assertion to something else during shutdown that is more resilient.
Jim/Blake, any ideas of how to progress here?
Flags: needinfo?(mrbkap)
Flags: needinfo?(jmathies)
Assignee: nobody → mrbkap
Flags: needinfo?(mrbkap)
This is almost certainly the same problem that billm found in bug 1138520, comment 26. In particular, KillHard either isn't killing the (hung?) child process or, at least, killing the process without notifying the parent and closing the channel.

I don't know too much about Windows IPC stuff and especially its interaction with force killing processes. Jim, do you know what could cause this? It's worth noting that in the past, this seems to have happened for hung child processes (and those processes stayed hung for at least a minute, triggering the parent process's shutdown hang killer).

We probably should be figuring out what's happening to these processes.
[Tracking Requested - why for this release]:
so far in 58rc1 this is the #2 top browser crash (close to 3% of all browser crashes)
IF we can get a fix ready, this seems to be a good candidate for inclusion into a potential 58.0.1 release. GChang, fYI
Flags: needinfo?(gchang)
Hey Bob, do you have any ideas for how to track this down. I'm wondering if the child process ever starts up at all.

One thing I've noticed is that we don't seem to ever upgrade the process priority of the preallocated processes that we create. I suppose it might be possible, then, that it doesn't get scheduled ever. I'll file a new bug to track that.
Flags: needinfo?(bobowencode)
(In reply to Blake Kaplan (:mrbkap) from comment #11)
> One thing I've noticed is that we don't seem to ever upgrade the process
> priority of the preallocated processes that we create. I suppose it might be
> possible, then, that it doesn't get scheduled ever. I'll file a new bug to
> track that.

After reading more code, setting process priorities isn't enabled on any of our current platforms anyway, so fixing this won't have any effect.
I'll track this as a possible dot release candidate for 58.
Flags: needinfo?(gchang)
Blocks: 1385249
also impacts both 64 and 32 bit builds, although 32-bit has a much higher incident rate.
(In reply to Jim Mathies [:jimm] from comment #14)
> * uptimes are between 25 and about 130 seconds, none higher.
> 
> I'm not sure what those uptimes imply but I find it interesting. An odd
> timing window.

I think super search is messing with me here, uptimes go higher they just don't show up in search results.
Flags: needinfo?(jmathies)
(In reply to Blake Kaplan (:mrbkap) from comment #8)
> This is almost certainly the same problem that billm found in bug 1138520,
> comment 26. In particular, KillHard either isn't killing the (hung?) child
> process or, at least, killing the process without notifying the parent and
> closing the channel.

I don't think it's the responsibility of the child process to communicate shutdown. The KillProcess call in KillHard's OnGenerateMinidumpComplete calls directly through to TerminateProcess. I've never seen that call fail. There's a scary wait in here btw [1], 60 seconds waiting for the process to shut down.

> I don't know too much about Windows IPC stuff and especially its interaction
> with force killing processes. Jim, do you know what could cause this? It's

Not yet. 

One note, I don't think this is a case where the child never launches since we open the handle [2] right before we try to terminate it. That call should fail for an invalid handle.

[1] https://searchfox.org/mozilla-central/source/ipc/chromium/src/base/process_util_win.cc#427
[2] https://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#3161
Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Flags: needinfo?(bobowencode)
Assignee: mrbkap → nobody
Priority: -- → P2
We call MOZ_CRASH[1] because we believe the MessageChannel to not be closed. We rely on Unsound_IsClosed()[2] to determine whether or not the channel is closed. But the method name implies that it isn't sound to rely 100% on its return value to determine whether or not the channel is closed, and it is further confirmed in the following comment[3]:

    // Unsound_IsClosed and Unsound_NumQueuedMessages are safe to call from any
    // thread, but they make no guarantees about whether you'll get an
    // up-to-date value; the values are written on one thread and read without
    // locking, on potentially different threads.  Thus you should only use
    // them when you don't particularly care about getting a recent value (e.g.
    // in a memory report).

It seems like we shouldn't crash the application based on a possibly out-of-date value. Even if there is a true issue here and the message channel has indeed not been shut down yet, we should check this based on a guaranteed up-to-date value to get an accurate measure of how prevalent this problem is.

Bill, I see that this was added in bug 1119878. Do you agree with the above and do you happen to know of a way to check whether the message channel is closed in a guaranteed up-to-date way?

[1] https://hg.mozilla.org/releases/mozilla-beta/annotate/73ef186ad51a/ipc/glue/MessageChannel.cpp#l704
[2] https://hg.mozilla.org/releases/mozilla-beta/annotate/73ef186ad51a/ipc/glue/MessageChannel.cpp#l701
[3] https://hg.mozilla.org/releases/mozilla-beta/annotate/73ef186ad51a/ipc/glue/MessageChannel.h#l310
Flags: needinfo?(bill.mccloskey)
I think in this case it's safe to use Unsound_IsClosed. The reason Unsound_IsClosed is marked "unsound" is that it reads state that's written on the I/O thread. However, assuming that someone called MessageChannel::Close() on the main thread already (which is what this assertion is trying to check), we should have already done a round-trip from the main thread to the I/O thread in order to close the channel. That happens like this:

MessageChannel::Close() ->
  MessageChannel::SynchronouslyClose() ->
    ProcessLink::SendClose() -> ...via PostMessage to I/O thread...
      ProcessLink::OnCloseChannel() ->
        Transport::Close() [this is the platform-specific version that sets closed_]
        Sets MessageChannel::mChannelState to ChannelClosed and notifies CV
    SynchronouslyClose also waits on the mChannelState becoming ChannelClosed

So the whole thing is race-free *assuming* that you've already called MessageChannel::Close() on the main thread. But that's what we're trying to assert anyway.

As to why this crash is happening, I'm afraid I don't know.
Flags: needinfo?(bill.mccloskey)
(In reply to Bill McCloskey [inactive unless it's an emergency] (:billm) from comment #19)
> (...)
> 
> So the whole thing is race-free *assuming* that you've already called
> MessageChannel::Close() on the main thread. But that's what we're trying to
> assert anyway.
> 
> As to why this crash is happening, I'm afraid I don't know.

Thank you for the clarifications! In this case we might instead be dealing with a channel error, timeout or similar. Unsound_IsClosed() checks if the channel is in the ChannelClosed state. But looking at MessageChannel::Close(), there are situations that could result in a different channel state.

I'm suggesting that we land this diagnostics patch to figure out what state the channel is in. This will show whether we crash due to one or several different states. We can then work backwards and figure out what might be causing the channel to be in this state when it is expected to be closed.
Assignee: nobody → spohl.mozilla.bugs
Status: NEW → ASSIGNED
Attachment #8954788 - Flags: review?(jmathies)
[Tracking Requested - why for this release]:
Comment on attachment 8954788 [details] [diff] [review]
Diagnostics patch for mChannelState

Review of attachment 8954788 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm. any bets on which we'll end up with?
Attachment #8954788 - Flags: review?(jmathies) → review+
(In reply to Jim Mathies [:jimm] from comment #22)
> Comment on attachment 8954788 [details] [diff] [review]
> Diagnostics patch for mChannelState
> 
> Review of attachment 8954788 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> lgtm. any bets on which we'll end up with?

I believe it's most likely that we'll see ChannelError, but no matter what the state will be we should be able to add more diagnostics once we know what we're dealing with.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b05965477a5adc39e0d64f61d13c278e8cf9989a
Bug 1423261: Diagnostics patch to obtain more info about the IPC channel state when we expect it to be closed. r=jimm
Keywords: leave-open
Hopefully the diagnostics will help as 60 goes to beta.
We discovered that the crash reports in Socorro did not include the protocol name as an annotation, even though it was added by calling CrashReporter::AnnotateCrashReport just before MOZ_CRASH. I've now discovered that Socorro appears to filter annotations based on a schema[1]. `ProtocolName` was not in the schema. However, `IPCFatalErrorProtocol` is and I have confirmed that crash reports for other crashes successfully submit the protocol name under `IPCFatalErrorProtocol`.

Crash reports after this patch should tell us which protocol(s) is/are causing problems here.

[1] https://github.com/mozilla-services/socorro/blob/master/socorro/schemas/crash_report.json
Attachment #8958019 - Flags: review?(jmathies)
(In reply to Stephen A Pohl [:spohl] from comment #28)
> Created attachment 8958019 [details] [diff] [review]
> Fix the way we annotate crash reports
> 
> We discovered that the crash reports in Socorro did not include the protocol
> name as an annotation, even though it was added by calling
> CrashReporter::AnnotateCrashReport just before MOZ_CRASH. I've now
> discovered that Socorro appears to filter annotations based on a schema[1].
> `ProtocolName` was not in the schema. However, `IPCFatalErrorProtocol` is
> and I have confirmed that crash reports for other crashes successfully
> submit the protocol name under `IPCFatalErrorProtocol`.
> 
> Crash reports after this patch should tell us which protocol(s) is/are
> causing problems here.
> 
> [1]
> https://github.com/mozilla-services/socorro/blob/master/socorro/schemas/
> crash_report.json

We set these other values here - 

https://searchfox.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#291

I think the issue is that ProtocolName isn't getting included in super search queries. We have to file a bug on getting that added.
Bug 1257986 is similar.
Attachment #8958019 - Flags: review?(jmathies)
(In reply to Jim Mathies [:jimm] from comment #29)
> I think the issue is that ProtocolName isn't getting included in super
> search queries. We have to file a bug on getting that added.

There are several reasons I thought changing ProtocolName to IPCFatalErrorProtocol would be preferable over adding ProtocolName to super search:
1. Both "tags" (not sure if this is the right terminology) refer to the same thing: the name of the protocol in question during a fatal IPC error.
2. IPCFatalErrorProtocol has a specific name that more accurately describes that the reason why it was added to the crash report is because there was a fatal error with this particular protocol.
3. ProtocolName is a very generic name and even if we went the route of adding it to super search, I would suggest that we change it before adding it to super search to make it more specific.
4. If we were to change ProtocolName to a more specific name, we would end up with something that sounds a lot like IPCFatalErrorProtocol.
5. The current use of IPCFatalErrorProtocol that you mention is for the same purpose as here: adding the problematic protocol name to the crash report.
6. There are enough new crash reports coming in that we don't need to go back through existing crash reports and add ProtocolName to super search to get to the important data. Looking at IPCFatalErrorProtocol going forward would be sufficient.

Unless there is an implicit reason that I'm not aware of that says that a specific crash annotation tag (IPCFatalErrorProtocol) cannot be used in two distinct places in our code base, I believe we would have a more accurate tag name for this crash annotation going forward and we would see results come in faster by switching to IPCFatalErrorProtocol than by adding ProtocolName to super search.
Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Attachment #8958019 - Flags: review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ea7d945741c1ffb4db49a4d6c0f61ec973eafce
Bug 1423261: Submit IPC protocol names in shutdown crash reports in a way that Socorro is able to display. r=jimm
Attached patch Add crash message annotation (obsolete) — Splinter Review
At a high level, shutdown works like this:
1. Parent tells child to shut down and sets mShutdownPending to true.
2. Child goes through shutdown sequence. At the end of the sequence, the child sends a FinishShutdown message to the parent.
3. The parent receives the FinishShutdown message, calls Close() on the message channel and completes shutdown.

This patch adds a crash annotation message to our crash reports for nightly builds. This will change the crash signature a bit because we will trigger this crash inside ContentParent, but it will allow us to gather more information about the state of the ContentParent. My best guess at the moment is that the ContentParent is in the `shutdown pending` state and fails to receive a FinishShutdown message from the child in a timely manner. Assuming that this is the case, we may be able to do away with this roundtrip during shutdown. The parent and the child both have ForceKillTimers and we may be able to rely on this timer to shut down the child should it not terminate in a timely manner. But we should confirm that we're in the `shutdown pending` state first.
Attachment #8959578 - Flags: review?(jmathies)
Attached patch Add crash message annotation (obsolete) — Splinter Review
Uploaded the wrong patch. This is it instead.
Attachment #8959578 - Attachment is obsolete: true
Attachment #8959578 - Flags: review?(jmathies)
Attachment #8959581 - Flags: review?(jmathies)
Comment on attachment 8959581 [details] [diff] [review]
Add crash message annotation

Review of attachment 8959581 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm!
Attachment #8959581 - Flags: review?(jmathies) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/58ddcf890331b9ddbc81b9bf4e8e3e04f551996e
Bug 1423261: Add crash message annotations to our crash reports to help investigate shutdown crashes due to open message channels when they are expected to be closed. Nightly only. r=jimm
The first crash report with the latest diagnostics patch has just come in:
https://crash-stats.mozilla.com/report/index/63545839-471f-4910-ae37-cb34a0180320

The IPC Fatal Error Message is:
queued-ipc-messages/content-parent(Browser, pid=-1, open channel, 0x1fd186b8, refcnt=1, numQueuedMessages=140, remoteType=web, mCalledClose=false, mCalledKillHard=true, mShutdownPending=true, mIPCOpen=true)

This seems to confirm my previous suspicion that the parent is in the shutdown pending state, but fails to receive confirmation from the child that it has shut down in a timely manner. Also notice that KillHard has been called on the parent, which could be the reason why we see this crash signature.

Next, I will explore if the parent can request child shutdowns without waiting for those shutdowns to complete.
Crash Signature: [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent] → [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent] [@ mozilla::dom::ContentParent::~ContentParent]
There may be a simple explanation why we might be getting into this state: when a shutdown message is received within a nested event loop by the child, we delay shutdown by an arbitrary 100ms in the hopes that the event loop will have finished by then. This is to prevent premature termination of "unload" or "pagehide" handlers, for example. However, there was no limit to how many times shutdown could be delayed by 100ms. If shutdown was delayed too long, the ForceKillTimer would trigger in the parent, resulting in this crash here.

This patch adds a limit to the number of times that a child can delay shutdown. The limit is equal to half the ForceKillTimer timeout to allow sufficient time for the child to send the FinishShutdown message back to the parent.

This patch should help us verify this theory, and if confirmed, fix the crash.
Attachment #8960940 - Flags: review?(jmathies)
Priority: P2 → P1
Comment on attachment 8960940 [details] [diff] [review]
Patch - set shutdown timeout in child processes

Review of attachment 8960940 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentChild.cpp
@@ +3025,5 @@
>        NewRunnableMethod(
>          "dom::ContentChild::RecvShutdown", this,
>          &ContentChild::ShutdownInternal),
> +      delay);
> +    mShutdownTimeout -= delay;

Before these changes, we might get caught up in this forever right? With these changes and with a default value of 5 for dom.ipc.tabs.shutdownTimeoutSecs we'll wait up to 2.5 seconds. What's the KillHard timeout? 5 seconds?

::: dom/ipc/ContentChild.h
@@ +830,5 @@
>    // NOTE: This member is atomic because it can be accessed from off-main-thread.
>    mozilla::Atomic<uint32_t> mPendingInputEvents;
>  #endif
>  
> +  int32_t mShutdownTimeout;

nit - lets move this up near this line - 

mozilla::Atomic<bool> mShuttingDown
Attachment #8960940 - Flags: review?(jmathies) → review+
Thank you! Addressed feedback, carrying over r+.

(In reply to Jim Mathies [:jimm] from comment #43)
> Comment on attachment 8960940 [details] [diff] [review]
> Patch - set shutdown timeout in child processes
> 
> Review of attachment 8960940 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/ipc/ContentChild.cpp
> @@ +3025,5 @@
> >        NewRunnableMethod(
> >          "dom::ContentChild::RecvShutdown", this,
> >          &ContentChild::ShutdownInternal),
> > +      delay);
> > +    mShutdownTimeout -= delay;
> 
> Before these changes, we might get caught up in this forever right?

That's right, which may result in the parent killing itself via KillHard(). The crash in this bug occurs because we notice that the channel hasn't been closed when the ContentParent gets destroyed. KillHard() does not attempt to close the channel and therefore it seems likely that this is the reason for these crashes.

> With these changes and with a default value of 5 for
> dom.ipc.tabs.shutdownTimeoutSecs we'll wait up to 2.5 seconds. What's the
> KillHard timeout? 5 seconds?

Right, the code doesn't make this particularly clear but dom.ipc.tabs.shutdownTimeoutSecs actually *is* the KillHard timeout.
Attachment #8960940 - Attachment is obsolete: true
Attachment #8961206 - Flags: review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff30955a00d2f6cbba9288b3ba2b7b749a577628
Bug 1423261: Don't allow child processes to delay shutdown endlessly when a shutdown message is received from within a nested event loop. r=jimm
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent

mozilla::dom::ContentParent::~ContentParent 


Top Crashers for Firefox 61.0a1

26 	0.64% 	0.03% 	mozilla::dom::ContentParent::~ContentParent	33 	33 	0 	0 	32 	0 	2016-02-24

Top Crashers for Firefox 60.0b

6 	3.28% 	0.15% 	mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent	1181 	1181 	0 	0 	1171 	0 	2017-05-17

Top Crashers for Firefox 59.0.2

4 	2.11% 	1.08% 	mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent	2543 	2543 	0 	0 	2456 	0 	2017-05-17 
	
Top Crashers for Firefox 58.0.2

3 	2.33% 	-0.38% 	mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent	312 	312 	0 	0 	315 	0 	2017-05-17
Blocks: 1349699
Depends on: 1364193
See Also: → 1385252
No longer blocks: 1349699
Depends on: 1349699
Comment on attachment 8961206 [details] [diff] [review]
Patch - set shutdown timeout in child processes

This has been backed out (see comment 50). After further digging it turns out that the shutdown delay due to nested event loops was introduced to help with content process shutdown hangs. Adding a timeout to this delay simply gets us back to the state before we had an explicit shutdown delay due to nested event loops, i.e. a shutdown hang.
Attachment #8961206 - Attachment is obsolete: true
This patch should fix the crashes reported here by closing the IPC channel during KillHard shutdowns. We are already submitting both parent- and content process crash dumps during KillHard shutdowns and we will therefore continue to have the same data to investigate content process shutdown hangs.
Attachment #8966345 - Flags: review?(jmathies)
Attachment #8966345 - Flags: review?(jmathies) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/f45d6ae3fdb194a07ef5eba9f9a760f28f564b64
Bug 1423261: Close the IPC channel during KillHard shutdowns to prevent shutdown crashes in the parent process. r=jimm
(In reply to Natalia Csoregi [:nataliaCs] from comment #55)
> https://hg.mozilla.org/mozilla-central/rev/f45d6ae3fdb1

This causes bug 1453252. The fix may be as simple as restricting the Close() call to the parent side, but I will need to investigate further when I get back next week. Will back out shortly.
Depends on: 1453252
Backed out changeset f45d6ae3fdb1 (bug 1423261) for causing bug 1453252.
Flags: needinfo?(spohl.mozilla.bugs)
Backout by shindli@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/88f297d206a6
Backed out changeset f45d6ae3fdb1 for causing bug 1453252. a=backout
The previous patch runs into issues when the IPC channel gets properly closed before cycle collection runs during a KillHard shutdown. So we've learned that the IPC channel may be in either connected or closed state at the time of cycle collection during a KillHard shutdown.

Since we now understand that the parent may still have a connected IPC channel during a KillHard shutdown, it makes sense to skip the intentional crash reported in this bug when KillHard has been called. We continue to collect crash dumps for hung content processes as before.
Attachment #8966345 - Attachment is obsolete: true
Flags: needinfo?(spohl.mozilla.bugs)
Attachment #8968272 - Flags: review?(jmathies)
Attachment #8968272 - Flags: review?(jmathies) → review+
I was a bit too quick in posting my previous patch. The patch applied to code that we added for diagnostics (attachment 8959581 [details] [diff] [review]) and which we should revert. I'm posting three patches instead, of which this is the first one:
1. Revert changeset 58ddcf890331 (attachment 8959581 [details] [diff] [review]) to get back to the previous crash signature.
2. Add a new field (mInKillHardShutdown) to MessageChannel to be able to tell when we're in a KillHard shutdown during cycle collection.
3. Restrict patch 2 above to Nightly only. This makes it easy to turn the fix on for all branches by backing this patch out.
This reverts changeset 58ddcf890331 (attachment 8959581 [details] [diff] [review]) which was added for diagnostic purposes. Marking as r+ since this is a straight backout.
Attachment #8968272 - Attachment is obsolete: true
Attachment #8968590 - Flags: review+
This adds a new field (mInKillHardShutdown) to MessageChannel to be able to tell when we're in a KillHard shutdown. This allows us to skip the intentional crash (this bug) during cycle collection.
Attachment #8968591 - Flags: review?(jmathies)
Attached patch Restrict fix to nightly only (obsolete) — Splinter Review
This restricts the fix to nightly only for now. A simple backout of this patch will allow us to turn the fix on for all branches.
Attachment #8968592 - Flags: review?(jmathies)
Attachment #8968591 - Flags: review?(jmathies) → review+
Attachment #8968592 - Flags: review?(jmathies) → review+
This has fallen 23 spots in the nightly top crash list and there are no reported crashes after the 20180417225505 build. This looks ready to be turned on for all branches.
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa51d5460d2815b8794ba244dcf279632c8fce22
Backout changeset c8842b205236 to turn on the fix for bug 1423261 on all branches. a=backout
Attachment #8938319 - Attachment is obsolete: true
Attachment #8959581 - Attachment is obsolete: true
Attachment #8968590 - Attachment is obsolete: true
Attachment #8968592 - Attachment is obsolete: true
Keywords: leave-open
I suggest we let this ride the train since this is a shutdown crash and doesn't have a real impact on users, other than seeing a crash dialog.
Sounds good to me, thanks Stephen.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Stephen: I see a few Mac signatures in crash stats, some 61/62: https://bit.ly/2rIEkwT. Is this another manifestation of the crash? Crash reason: MOZ_CRASH(MessageChannel destroyed without being closed (mChannelState == ChannelConnected).)
Flags: needinfo?(spohl.mozilla.bugs)
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #72)
> Stephen: I see a few Mac signatures in crash stats, some 61/62:
> https://bit.ly/2rIEkwT. Is this another manifestation of the crash? Crash
> reason: MOZ_CRASH(MessageChannel destroyed without being closed
> (mChannelState == ChannelConnected).)

It is possible that this is the child side of the same issue, i.e. we're in a KillHard shutdown when the child attempts to destroy the MessageChannel. This would need its own investigation and fix however, so it would be best to track this in a separate bug.
Flags: needinfo?(spohl.mozilla.bugs)
would this be an issue to consider fixing in 60esr?
Flags: needinfo?(ryanvm)
I'll redirect to Stephen on that.
Flags: needinfo?(ryanvm) → needinfo?(spohl.mozilla.bugs)
Comment on attachment 8968591 [details] [diff] [review]
Fix: Skip intentionally crashing the browser when KillHard has been called

Yes, it seems reasonable to consider this for ESR60.

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: Avoids showing crash dialogs when we terminate browser processes during excessively long shutdowns.
User impact if declined: If the browser takes too long to shut down we terminate the browser processes. Without this fix, the user is presented with a crash dialog.
Fix Landed on Version: 61
Risk to taking this patch (and alternatives if risky): Has been riding the trains with no fallout.
String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Flags: needinfo?(spohl.mozilla.bugs)
Attachment #8968591 - Flags: approval-mozilla-esr60?
Comment on attachment 8968591 [details] [diff] [review]
Fix: Skip intentionally crashing the browser when KillHard has been called

Fixes a topcrash on ESR60 and has been shipping with Fx61 without any known problems. Approved for ESR 60.2.
Attachment #8968591 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
You need to log in before you can comment on or make changes to this bug.