Closed Bug 1227312 Opened 4 years ago Closed 4 years ago

Crash report telemetry data is off for content process crashes

Categories

(Core :: DOM: Content Processes, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
e10s m8+ ---
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: benjamin, Assigned: jimm)

References

Details

Attachments

(1 file, 2 obsolete files)

Found while looking at http://bsmedberg.github.io/telemetry-dashboard/new-pipeline/crash-summary.html

There are two primary histograms for child process stability:
* SUBPROCESS_ABNORMAL_ABORT is supposed to record when a subprocess dies unexpectedly (ActorDestroy with reason of AbnormalAbort)
* SUBPROCESS_CRASHES_WITH_DUMP is supposed to record how many of these aborts are caught in our crash reporter.

I assert that SUBPROCESS_CRASHES_WITH_DUMP should always be a lower number than SUBPROCESS_ABNORMAL_ABORT. But in fact for content subprocesses, this isn't true. I'm seeing the SUBPROCESS_CRASHES_WITH_DUMP totals be almost 200% that of SUBPROCESS_ABNORMAL_ABORT. I'd expect this to be 70-90% based on plugins and prior experience.

This needs investigation.
I tested a few things:
- Killing plugin-container.exe increments SUBPROCESS_ABNORMAL_ABORT
- Crashing plugin-container.exe increments SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WITH_DUMP
- Logging out of Windows while Firefox is running does not alter SUBPROCESS_ABNORMAL_ABORT/SUBPROCESS_CRASHES_WITH_DUMP

With these results, it seems quite odd that SUBPROCESS_CRASHES_WITH_DUMP could be higher than SUBPROCESS_ABNORMAL_ABORT. I'll dig deeper and see what I can find.
Assignee: nobody → mconley
Flags: needinfo?(mconley)
Looking into this now.
Status: NEW → ASSIGNED
Flags: needinfo?(mconley)
Attached file My telemetry ping
Apparently, I've hit something like this myself. Here's a Telemetry ping my profile sent where SUBPROCESS_CRASHES_WITH_DUMP had 2 entries (both "0", strangely...), but no data for SUBPROCESS_ABNORMAL_ABORT.
Interestingly, returning false from a message handler in the parent results in us adding two counts to SUBPROCESS_CRASHES_WITH_DUMP, and 1 count to SUBPROCESS_ABNORMAL_ABORT. I suspect this is where the 200% is coming from.

Looking into the "why" now.
The problem appears to be in CrashReporterParent::GenerateCompleteMinidump[1] - we're calling CrashReporterParent::FinalizeChildData twice in this case.

Once when we call GenerateChildData (since GenerateChildData calls FinalizeChildData) itself, and then again immediately after calling GenerateChildData.

FinalizeChildData calls CrashReporterParent::NotifyCrashService, which increments the Telemetry probe.

So I think that's what's causing the discrepancy here.

[1]: https://dxr.mozilla.org/mozilla-central/rev/974fe614d5299159dc16d98d97d76af653158d29/dom/ipc/CrashReporterParent.h#275
Hey jimm, the blame says that you added these FinalizeChildData's. Which one do we want to remove, do you know? Or do we want to call it twice for some reason that I don't know?
Flags: needinfo?(jmathies)
hmm, not sure we can remove one, I have to think through the different threading scenarios here. I can take this.
Assignee: mconley → jmathies
Flags: needinfo?(jmathies)
Summary: Investigate: many more content process crashes than aborts should not be possible. → Crash report telemetry data is off for content process crashes
I'm 95% certain my finding in comment 5 is what's causing this telemetry discrepancy.

If that's true, then the worst-case scenario (we're somehow missing some crash data), doesn't seem to be the case - we're just overcounting the number of crash reports we've gathered.

So this should definitely be fixed, but because it's not the worst-case scenario, I'm not 100% certain it should be an M8. Going to re-nom for discussion.
One of the e10s criteria is no worse crash rate, and the SUBPROCESS_CRASHES_WITH_DUMP histogram is our primary metric. If we're sometimes double-counting, our crash rate is going to appear much worse than it is. Also this is deflating the crash-submission rates.

I think this should be an M8 blocker.
Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?mconley
Attachment #8694783 - Flags: review?(mconley)
Comment on attachment 8694783 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/26873/diff/1-2/
Attachment #8694783 - Attachment description: MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?mconley → MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Attachment #8694783 - Flags: review?(mconley) → review?(ted)
With these patches applied I tested the following crashes:

1) crashing plugin
2) crashing content process
3) plugin hang termination
4) abort handler in chrome

each upped both probes by 1.
Attachment #8694783 - Attachment is obsolete: true
Attachment #8694783 - Flags: review?(ted)
Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted

Unsure why the reviewer wasn't set on this. jimm, I'd be interested in knowing what your interactions with MozReview were that put you in this state.
Attachment #8695320 - Flags: review?(ted)
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted

https://reviewboard.mozilla.org/r/27067/#review26025

I'm going to hit a point where I don't understand how any of this IPC crashreporting code works, but this looks reasonable reading through it and reading mconley's description of the bug.
Attachment #8695320 - Flags: review?(ted) → review+
This bug doesn't affect the accuracy of SUBPROCESS_ABNORMAL_ABORT, right?
Keywords: checkin-needed
(In reply to (PTO Jan 4-10) Vladan Djeric (:vladan) -- please needinfo from comment #19)
> This bug doesn't affect the accuracy of SUBPROCESS_ABNORMAL_ABORT, right?

no it doesn't.
https://hg.mozilla.org/mozilla-central/rev/d6705099c749
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted

Approval Request Comment
[Feature/regressing bug #]:
e10s related, telemetry probe issue
[User impact if declined]:
no user impact, this bug has to do with inaccurate telemetry probe data related to content process crash rates.
[Describe test coverage new/current, TreeHerder]:
on mc
[Risks and why]:
messes with crash reporting code. we should let this bake on mc for a few days. 
[String/UUID change made/needed]:
none
Attachment #8695320 - Flags: approval-mozilla-aurora?
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted

Baked for 6 days, I guess this is enough. Taking it.
Attachment #8695320 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8695320 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.