Closed
Bug 1227312
Opened 9 years ago
Closed 9 years ago
Crash report telemetry data is off for content process crashes
Categories
(Core :: DOM: Content Processes, defect)
Core
DOM: Content Processes
Tracking
()
RESOLVED
FIXED
mozilla46
People
(Reporter: benjamin, Assigned: jimm)
References
Details
Attachments
(1 file, 2 obsolete files)
539.70 KB,
text/plain
|
Details |
Found while looking at http://bsmedberg.github.io/telemetry-dashboard/new-pipeline/crash-summary.html
There are two primary histograms for child process stability:
* SUBPROCESS_ABNORMAL_ABORT is supposed to record when a subprocess dies unexpectedly (ActorDestroy with reason of AbnormalAbort)
* SUBPROCESS_CRASHES_WITH_DUMP is supposed to record how many of these aborts are caught in our crash reporter.
I assert that SUBPROCESS_CRASHES_WITH_DUMP should always be a lower number than SUBPROCESS_ABNORMAL_ABORT. But in fact for content subprocesses, this isn't true. I'm seeing the SUBPROCESS_CRASHES_WITH_DUMP totals be almost 200% that of SUBPROCESS_ABNORMAL_ABORT. I'd expect this to be 70-90% based on plugins and prior experience.
This needs investigation.
Updated•9 years ago
|
tracking-e10s:
--- → ?
Updated•9 years ago
|
Blocks: e10s-measurement
Comment 1•9 years ago
|
||
I tested a few things:
- Killing plugin-container.exe increments SUBPROCESS_ABNORMAL_ABORT
- Crashing plugin-container.exe increments SUBPROCESS_ABNORMAL_ABORT and SUBPROCESS_CRASHES_WITH_DUMP
- Logging out of Windows while Firefox is running does not alter SUBPROCESS_ABNORMAL_ABORT/SUBPROCESS_CRASHES_WITH_DUMP
With these results, it seems quite odd that SUBPROCESS_CRASHES_WITH_DUMP could be higher than SUBPROCESS_ABNORMAL_ABORT. I'll dig deeper and see what I can find.
Updated•9 years ago
|
Comment 3•9 years ago
|
||
Apparently, I've hit something like this myself. Here's a Telemetry ping my profile sent where SUBPROCESS_CRASHES_WITH_DUMP had 2 entries (both "0", strangely...), but no data for SUBPROCESS_ABNORMAL_ABORT.
Comment 4•9 years ago
|
||
Interestingly, returning false from a message handler in the parent results in us adding two counts to SUBPROCESS_CRASHES_WITH_DUMP, and 1 count to SUBPROCESS_ABNORMAL_ABORT. I suspect this is where the 200% is coming from.
Looking into the "why" now.
Comment 5•9 years ago
|
||
The problem appears to be in CrashReporterParent::GenerateCompleteMinidump[1] - we're calling CrashReporterParent::FinalizeChildData twice in this case.
Once when we call GenerateChildData (since GenerateChildData calls FinalizeChildData) itself, and then again immediately after calling GenerateChildData.
FinalizeChildData calls CrashReporterParent::NotifyCrashService, which increments the Telemetry probe.
So I think that's what's causing the discrepancy here.
[1]: https://dxr.mozilla.org/mozilla-central/rev/974fe614d5299159dc16d98d97d76af653158d29/dom/ipc/CrashReporterParent.h#275
Comment 6•9 years ago
|
||
Hey jimm, the blame says that you added these FinalizeChildData's. Which one do we want to remove, do you know? Or do we want to call it twice for some reason that I don't know?
Flags: needinfo?(jmathies)
Comment 7•9 years ago
|
||
The FinalizeChildData's were added in https://hg.mozilla.org/mozilla-central/rev/ce6dbefadccd bug 1153205.
Assignee | ||
Comment 8•9 years ago
|
||
hmm, not sure we can remove one, I have to think through the different threading scenarios here. I can take this.
Assignee: mconley → jmathies
Flags: needinfo?(jmathies)
Assignee | ||
Updated•9 years ago
|
Summary: Investigate: many more content process crashes than aborts should not be possible. → Crash report telemetry data is off for content process crashes
Comment 9•9 years ago
|
||
I'm 95% certain my finding in comment 5 is what's causing this telemetry discrepancy.
If that's true, then the worst-case scenario (we're somehow missing some crash data), doesn't seem to be the case - we're just overcounting the number of crash reports we've gathered.
So this should definitely be fixed, but because it's not the worst-case scenario, I'm not 100% certain it should be an M8. Going to re-nom for discussion.
Reporter | ||
Comment 10•9 years ago
|
||
One of the e10s criteria is no worse crash rate, and the SUBPROCESS_CRASHES_WITH_DUMP histogram is our primary metric. If we're sometimes double-counting, our crash rate is going to appear much worse than it is. Also this is deflating the crash-submission rates.
I think this should be an M8 blocker.
Assignee | ||
Comment 12•9 years ago
|
||
Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?mconley
Attachment #8694783 -
Flags: review?(mconley)
Assignee | ||
Comment 13•9 years ago
|
||
Comment on attachment 8694783 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/26873/diff/1-2/
Attachment #8694783 -
Attachment description: MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?mconley → MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Assignee | ||
Updated•9 years ago
|
Attachment #8694783 -
Flags: review?(mconley) → review?(ted)
Assignee | ||
Comment 14•9 years ago
|
||
With these patches applied I tested the following crashes:
1) crashing plugin
2) crashing content process
3) plugin hang termination
4) abort handler in chrome
each upped both probes by 1.
Assignee | ||
Updated•9 years ago
|
Attachment #8694783 -
Attachment is obsolete: true
Attachment #8694783 -
Flags: review?(ted)
Assignee | ||
Comment 15•9 years ago
|
||
Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Comment 16•9 years ago
|
||
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Unsure why the reviewer wasn't set on this. jimm, I'd be interested in knowing what your interactions with MozReview were that put you in this state.
Attachment #8695320 -
Flags: review?(ted)
Comment 17•9 years ago
|
||
Gentle ping
Comment 18•9 years ago
|
||
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
https://reviewboard.mozilla.org/r/27067/#review26025
I'm going to hit a point where I don't understand how any of this IPC crashreporting code works, but this looks reasonable reading through it and reading mconley's description of the bug.
Attachment #8695320 -
Flags: review?(ted) → review+
Comment 19•9 years ago
|
||
This bug doesn't affect the accuracy of SUBPROCESS_ABNORMAL_ABORT, right?
Assignee | ||
Updated•9 years ago
|
Keywords: checkin-needed
Assignee | ||
Comment 20•9 years ago
|
||
(In reply to (PTO Jan 4-10) Vladan Djeric (:vladan) -- please needinfo from comment #19)
> This bug doesn't affect the accuracy of SUBPROCESS_ABNORMAL_ABORT, right?
no it doesn't.
Comment 21•9 years ago
|
||
Keywords: checkin-needed
Comment 22•9 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox46:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Assignee | ||
Comment 23•9 years ago
|
||
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Approval Request Comment
[Feature/regressing bug #]:
e10s related, telemetry probe issue
[User impact if declined]:
no user impact, this bug has to do with inaccurate telemetry probe data related to content process crash rates.
[Describe test coverage new/current, TreeHerder]:
on mc
[Risks and why]:
messes with crash reporting code. we should let this bake on mc for a few days.
[String/UUID change made/needed]:
none
Attachment #8695320 -
Flags: approval-mozilla-aurora?
Updated•9 years ago
|
status-firefox45:
--- → affected
Comment 24•9 years ago
|
||
Comment on attachment 8695320 [details]
MozReview Request: Bug 1227312 - Avoid calling FinalizeChildData twice in GenerateCompleteMinidump. r?ted
Baked for 6 days, I guess this is enough. Taking it.
Attachment #8695320 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee | ||
Comment 25•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #22)
> https://hg.mozilla.org/mozilla-central/rev/d6705099c749
The results here look good, SUBPROCESS_CRASHES_WITH_DUMP has dropped off by about half while SUBPROCESS_ABNORMAL_ABORT remained unchanged.
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median!mean&cumulative=0&e10s=true&end_date=null&keys=content&max_channel_version=nightly%252F46&measure=SUBPROCESS_CRASHES_WITH_DUMP&min_channel_version=nightly%252F45&product=Firefox&sanitize=1&sort_keys=submissions&start_date=null&trim=1&use_submission_date=0
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median!mean&cumulative=0&e10s=true&end_date=null&keys=content&max_channel_version=nightly%252F46&measure=SUBPROCESS_ABNORMAL_ABORT&min_channel_version=nightly%252F45&product=Firefox&sanitize=1&sort_keys=submissions&start_date=null&trim=1&use_submission_date=0
Comment 26•9 years ago
|
||
bugherder uplift |
Assignee | ||
Updated•9 years ago
|
Attachment #8695320 -
Attachment is obsolete: true
You need to log in
before you can comment on or make changes to this bug.
Description
•