Collect more telemetry on why content processes crash

RESOLVED FIXED in Firefox 47

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: billm, Assigned: billm)

Tracking

(Blocks 1 bug)

unspecified
mozilla49
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+, firefox47 fixed, firefox48 fixed, firefox49 fixed)

Details

Attachments

(2 attachments, 2 obsolete attachments)

I'd like to collect more data on why content processes are crashing. We get a lot of information in the crash dump, but sometimes we don't get crash dumps.
I just happened to run into this problem. ContentCrashHandlers.jsm calls this.pageListener.sendAsyncMessage while receiving a RemotePage:Unload message. In some cases, this call will try to send a message (via the message manager) to the <browser> that is closing, which will throw. (Specifically this happens when the Unload message comes through the "message-manager-disconnect" observer.)

I think it's safe to switch the order here so that we don't send to a <browser> that's being torn down.

Mike, I'd ask Mossop for review, but he's not taking review requests.
Attachment #8748457 - Flags: review?(mconley)
Posted patch telemetry patch (obsolete) — Splinter Review
This patch tries to collect more telemetry about why we're crashing as well as why we are or are not submitting a crash report.

The main thing I'm interested in is the KILL_HARD stuff. I suspect there may be a lot of ShutdownKill crashes that don't show up in crashstats (because the browser is shutting down so we have no way to present a crash UI) but they're inflating the telemetry numbers. (I verified that we still collect telemetry at the ShutDownKill stage of shutdown.) This measurement will tell us how many of these there are.

I also changed the crash annotation for KillHard. I'm not sure why it was called ipc_channel_error--that has a very different meaning. However, I'm a little worried that this will mess up some of your crash-stats queries, Jim. I can remove this change, but I think it's nice to have.

The submission measurements admittedly a little iffy. I think there might be ways where the user can navigate away from about:tabcrashed without triggering the unload handler, so we might miss some NOT_SUBMITTED counts. We'll also do the wrong thing if the user has multiple content processes that crash around the same time (we'll only count it as presenting one crash report rather than two). But I think this will still be helpful.
Attachment #8748459 - Flags: review?(jmathies)
Comment on attachment 8748459 [details] [diff] [review]
telemetry patch

Review of attachment 8748459 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/modules/ContentCrashHandlers.jsm
@@ +297,5 @@
> +
> +    // Make sure to only count once even if there are multiple windows
> +    // that will all show about:tabcrashed.
> +    if (this._crashedTabCount == 0 && childID) {
> +      Services.telemetry.getHistogramById("FX_CONTENT_CRASH_NOT_SUBMITTED").add(1);

You should get mike to review these changes, I know very little about the crashed tab page behavior.

::: dom/ipc/ContentParent.cpp
@@ +3575,3 @@
>      nsDependentCString reason(aReason);
>      crashReporter->AnnotateCrashReport(
> +      NS_LITERAL_CSTRING("kill_hard_reason"),

Lets not change this. It'll break all sorts of search queries that are spread around plus various query scripts. We're just creating more work here.

@@ +3579,5 @@
>  
>      // Generate the report and insert into the queue for submittal.
>      mCreatedPairedMinidumps = crashReporter->GenerateCompleteMinidump(this);
> +
> +    Telemetry::Accumulate(Telemetry::SUBPROCESS_KILL_HARD, reason, 1);

This includes ipc error aborts right, so it's not going to give us visibility into shutdown only kills. I think this might be useful though so you might as well keep it in here.

What we need is to be able to subtract shutdown killhards from SUBPROCESS_CRASHES_WITH_DUMP [1]. I'd suggest we add a new probe in CrashReporterParent that gets set only when the content process is in shutdown.  ContentParent has an observer we can use to detect shutdown state.

[1] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/CrashReporterParent.cpp#231

::: dom/ipc/ContentParent.h
@@ -352,5 @@
> -  const nsCString& GetKillHardAnnotation() const { return mKillHardAnnotation; }
> -
> -  void SetKillHardAnnotation(const nsACString& aReason)
> -  {
> -    mKillHardAnnotation = aReason;

:( never put into use. Oh well.
Attachment #8748459 - Flags: review?(jmathies) → review-
Blocks: e10s-crashes
tracking-e10s: --- → +
Attachment #8748457 - Flags: review?(mconley) → review+
We do need to keep the current ipc_channel_error annotation just because it's baked into crash-stats and changing that is a bit painful.

Are we sure that the killhard reason can never contain identifying data? That is a requirement for using it as a telemetry key. If there is any question about this, I suggest we take the simpler patch which I'm going to upload now, which simply records the shutdownkill count directly.
Assignee: wmccloskey → benjamin
Status: NEW → ASSIGNED
Attachment #8748645 - Flags: review?(wmccloskey)
Comment on attachment 8748459 [details] [diff] [review]
telemetry patch

Review of attachment 8748459 [details] [diff] [review]:
-----------------------------------------------------------------

The ContentCrashHandlers.jsm changes look sane to me.
Attachment #8748459 - Flags: review+
New patch.
Assignee: benjamin → wmccloskey
Attachment #8748698 - Flags: review?(jmathies)
(In reply to Jim Mathies [:jimm] from comment #3)
> > +    Telemetry::Accumulate(Telemetry::SUBPROCESS_KILL_HARD, reason, 1);
> 
> This includes ipc error aborts right, so it's not going to give us
> visibility into shutdown only kills. I think this might be useful though so
> you might as well keep it in here.

Note that it's a keyed histogram, so the ShutDownKill aborts will be counted separately from the ipc aborts.

However, in the interests of getting this in quickly, I've changed it to record only ShutDownKill aborts.

> What we need is to be able to subtract shutdown killhards from
> SUBPROCESS_CRASHES_WITH_DUMP [1]. I'd suggest we add a new probe in
> CrashReporterParent that gets set only when the content process is in
> shutdown.  ContentParent has an observer we can use to detect shutdown state.

That would give us no data at all about how many ShutDownKill aborts there are. ShutDownKill aborts aren't as bad as other crashes, but they're not great. We lose telemetry, for example. So I'd still like to count them.

> Are we sure that the killhard reason can never contain identifying data?

Yes, I'm confident that it doesn't include any identifying data. It's basically all constants.

However, I changed the patch to only record the ShutDownKill reason just to make it easier to review.
Comment on attachment 8748645 [details] [diff] [review]
alternative telemetry patch; record the number of times we kill content processes because they hit the shutdown timeout,

Review of attachment 8748645 [details] [diff] [review]:
-----------------------------------------------------------------

As Jim mentioned, this will conflate ShutDownKill aborts with other IPC aborts, so we won't get the data we need.
Attachment #8748645 - Flags: review?(wmccloskey) → review-
Comment on attachment 8748645 [details] [diff] [review]
alternative telemetry patch; record the number of times we kill content processes because they hit the shutdown timeout,

Review of attachment 8748645 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, I see, you moved it to ShutDownKill. My patch is just a little nicer since it doesn't record if we already did a ShutDownKill (avoids double counting).
Attachment #8748645 - Flags: review-
Comment on attachment 8748698 [details] [diff] [review]
telemetry patch, v2

from a discussion on irc, r+ with the suggested change of going back to 

Telemetry::Accumulate(Telemetry::SUBPROCESS_KILL_HARD, reason, 1);
Attachment #8748698 - Flags: review?(jmathies) → review+
Benjamin, are you okay to include the reason? It's either a string literal (like "ShutDownKill") or an IPC error, which includes a message ID (compile-time constant) and message name and an error code. You can see that generated here:
http://searchfox.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#2020
(Note that the type comes from an enum generated by IPDL, so it's not an address or anything.)
Flags: needinfo?(benjamin)
Yes, I'm ok including that; please add a comment to the KillHard function explaining that the reason must be constant or get a data review.
Flags: needinfo?(benjamin)
The ipc message errors will change on nightly quite a bit but on release they should be constant. The counts may be quite useful in gauging how often these occur. FWIW I've stopped running my python scripts to pull these out of crash-stats on a daily basis (expiring access tokens are a pain), and the socorro bugs to add support for reporting these are totally dormant. So I think having this data someplace easily displayable will be good.
Attachment #8748645 - Attachment is obsolete: true
Attachment #8748459 - Attachment is obsolete: true
Comment on attachment 8748457 [details] [diff] [review]
RemotePageManager patch

Requesting approval so we can get more telemetry about e10s crash volume for the beta experiment.

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: Very low risk; just telemetry.
[String/UUID change made/needed]: none
Attachment #8748457 - Flags: approval-mozilla-beta?
Attachment #8748457 - Flags: approval-mozilla-aurora?
Comment on attachment 8748698 [details] [diff] [review]
telemetry patch, v2

Requesting approval so we can get more telemetry about e10s crash volume for the beta experiment.

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: Very low risk; just telemetry.
[String/UUID change made/needed]: none
Attachment #8748698 - Flags: approval-mozilla-beta?
Attachment #8748698 - Flags: approval-mozilla-aurora?
Ritu, we'd like to get these telemetry probes in beta build 3 if possible.
Flags: needinfo?(rkothari)
Comment on attachment 8748698 [details] [diff] [review]
telemetry patch, v2

More telemetry data to help improve e10s stability, Aurora48+, Beta47+
Flags: needinfo?(rkothari)
Attachment #8748698 - Flags: approval-mozilla-beta?
Attachment #8748698 - Flags: approval-mozilla-beta+
Attachment #8748698 - Flags: approval-mozilla-aurora?
Attachment #8748698 - Flags: approval-mozilla-aurora+
Attachment #8748457 - Flags: approval-mozilla-beta?
Attachment #8748457 - Flags: approval-mozilla-beta+
Attachment #8748457 - Flags: approval-mozilla-aurora?
Attachment #8748457 - Flags: approval-mozilla-aurora+
Blocks: 1270547
https://hg.mozilla.org/mozilla-central/rev/e930d26502a6
https://hg.mozilla.org/mozilla-central/rev/3a9efc13a56c
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Messed up while resolving the merge conflict on the uplift to beta. Fixed in https://hg.mozilla.org/releases/mozilla-beta/rev/609000bcc112
Blocks: 1271336
You need to log in before you can comment on or make changes to this bug.