split crash reports into crashes, shutdown hangs, and content-process hangs
Categories
(Socorro :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(1 file)
Socorro collects "crash reports" for a variety of projects. However, because the mechanism of generating a crash report, what it contains, and getting it to Mozilla so we can look into it is a convenient way to get data about problems with processes, not all crash reports are actually crashes.
Currently, we're getting:
- crash reports -- a process has crashed, the client generated a minidump, it get sent along
- shutdownhangs -- a process hung and another process generated a minidump and sent it along
- content-process hangs -- I'm not sure what this is, but Gabriele said it's a thing
- warnings -- at one point, Fenix was sending warnings which weren't crashes, but were things they were keeping tabs on; I think we're not getting these anymore
This becomes problematic for analyzing and investigating the crash report data because everything is jumbled up together.
For example, the TopCrashers report for Firefox is full of ShutDownKill signatures. Shutdown hangs are a problem, but it's unhelpful to have them overwhelming the TopCrashers report hiding crash issues. (bug #1624946)
This tracker bug covers looking into the problem, figuring out a plan, shopping it around to stakeholders, and then executing on it.
Assignee | ||
Comment 1•4 years ago
|
||
Making this a P3 for now. I don't know when I'll get to it.
cc:ing Gabriele because we were talking about this recently.
Some thoughts:
- maybe we set up a processor rule that groups the reports into report types: crash, shutdownhang, etc
- maybe we adjust the TopCrashers report to filter on report type--it currently filters on platform and process type
- maybe we default everything to look at crash report type and you have to explicitly choose other types
- maybe we put the type in the crash signature
Assignee | ||
Updated•4 years ago
|
Comment 2•4 years ago
|
||
I'll write some super-search queries to show what I hope will be the end result and post them here as examples.
Assignee | ||
Comment 3•4 years ago
|
||
I have some notes on this and I want to capture them in the comments rather than have notes in multiple places.
We should add a processor rule that adds a "report_type" field (or some similar name) that specifies the report type. It's flexible so if we get it wrong, we can tweak things and reprocess.
Assignee | ||
Comment 4•4 years ago
|
||
Bumping this up to P2 because this could help a bunch.
Comment 5•2 years ago
|
||
A quick refresh of how we'd like to categorize crashes (or rather reports given not all reports are crashes):
- Regular crashes
- Hangs, this would cover
a. Browser shutdown hangs (example query)
b. Content process shutdown hangs (example query)
c. Non-shutdown hangs we'll explicitly flag (see bug 1826703)
d. Content-process hangs which I don't remember if/how we deal with right now
Assignee | ||
Updated•1 years ago
|
Assignee | ||
Comment 6•1 year ago
|
||
I want to add a report_type
field that will take keywords. We'll start it off with:
hang
-- different kinds of hangscrash
-- regular crashes and anything that didn't get categorized as something else; e.g. it'll default to "crash"
We'll make this a public field. We'll index it as a keyword so it's available in search and aggregations.
Gabriele: Does that work for you? I can't tell from comment #5 if you wanted to further break it down by different hang types or not.
Comment 7•1 year ago
|
||
Yes it's fine. There we'll be different types of hangs but we don't need to tell them apart at that level, we can always facet on the hang type later (like we do with the crash reason with crashes). What we care about is that users can clearly tell apart hangs from crashes.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 8•1 year ago
|
||
I've finished up most of the changes we need to do here.
The only thing left is to figure out is how to transition from the old system (no report_type
field indexed) to the new system (report_type
field indexed). I think we have to treat all crash reports with no report_type
as a "crash". That means TopCrashers report for report type "crash" will still include hangs depending on whether old data is in the range.
If we think this is going to be confusing to everyone, the alternative is to land the processor and indexing changes now, let data with a report_type
accumulate and then land the TopCrashers changes after we have 4 weeks of data--TopCrashers has Days values of 1, 7, 14, 28.
I'll finish this up the week of July 17th.
Assignee | ||
Comment 9•1 year ago
|
||
Assignee | ||
Comment 10•1 year ago
|
||
This needs testing on stage. I'm pretty sure I got the handling for old data in the Top Crashers report correct.
Further, before it gets deployed to production, I need to send an email to stability, crash-reporting-wg, and firefox-dev mailing lists about the changes to the Top Crashers report and how the old data is handled.
Assignee | ||
Comment 11•1 year ago
|
||
Assignee | ||
Comment 12•1 year ago
|
||
That will auto-deploy to the staging environment.
I need to verify the following on stage:
- does the topcrashers report work when crash reports don't have a report_type field (it's not in the index mapping)?
- (next week) does the topcrashers report work when some crash reports have a report_type field (index has the field) and some don't (last week's index doesn't have the field)?
Assignee | ||
Comment 13•1 year ago
|
||
I looked at stage and because there's no report_type data and the Top Crashers report defaults to report_type="crash"
, the Top Crashers report is empty. That will be terrifying to anyone looking at it for the first few weeks after it goes to production.
I'm going to change the default to report_type="any". After this goes to production, we can write up a bug for changing the default at some point in the future. Maybe in a month.
Assignee | ||
Comment 14•1 year ago
|
||
Assignee | ||
Comment 15•1 year ago
|
||
I checked stage this morning and the filter appears to be working as expected.
I sent an email to the stability and crash-reporting-wg mailing lists about the upcoming change.
Assignee | ||
Comment 16•1 year ago
|
||
All the changes so far were deployed to prod just now in bug #1843869.
There won't be any report_type
data in the search index until a new index is created this weekend so the "crash" and "hang" filters won't have any results until next week.
I'll keep this open until next week after I verify everything is working as expected and write up a bug to change the report_type default from "any" to "crash".
Assignee | ||
Comment 17•1 year ago
|
||
I wrote up a bug for changing the default. I think we're good here.
Description
•