Closed Bug 1667997 Opened 4 years ago Closed 1 year ago

split crash reports into crashes, shutdown hangs, and content-process hangs

Categories

(Socorro :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

Socorro collects "crash reports" for a variety of projects. However, because the mechanism of generating a crash report, what it contains, and getting it to Mozilla so we can look into it is a convenient way to get data about problems with processes, not all crash reports are actually crashes.

Currently, we're getting:

  1. crash reports -- a process has crashed, the client generated a minidump, it get sent along
  2. shutdownhangs -- a process hung and another process generated a minidump and sent it along
  3. content-process hangs -- I'm not sure what this is, but Gabriele said it's a thing
  4. warnings -- at one point, Fenix was sending warnings which weren't crashes, but were things they were keeping tabs on; I think we're not getting these anymore

This becomes problematic for analyzing and investigating the crash report data because everything is jumbled up together.

For example, the TopCrashers report for Firefox is full of ShutDownKill signatures. Shutdown hangs are a problem, but it's unhelpful to have them overwhelming the TopCrashers report hiding crash issues. (bug #1624946)

This tracker bug covers looking into the problem, figuring out a plan, shopping it around to stakeholders, and then executing on it.

Making this a P3 for now. I don't know when I'll get to it.

cc:ing Gabriele because we were talking about this recently.

Some thoughts:

  1. maybe we set up a processor rule that groups the reports into report types: crash, shutdownhang, etc
  2. maybe we adjust the TopCrashers report to filter on report type--it currently filters on platform and process type
  3. maybe we default everything to look at crash report type and you have to explicitly choose other types
  4. maybe we put the type in the crash signature
Priority: -- → P3

I'll write some super-search queries to show what I hope will be the end result and post them here as examples.

I have some notes on this and I want to capture them in the comments rather than have notes in multiple places.

We should add a processor rule that adds a "report_type" field (or some similar name) that specifies the report type. It's flexible so if we get it wrong, we can tweak things and reprocess.

Bumping this up to P2 because this could help a bunch.

Priority: P3 → P2

A quick refresh of how we'd like to categorize crashes (or rather reports given not all reports are crashes):

  1. Regular crashes
  2. Hangs, this would cover
    a. Browser shutdown hangs (example query)
    b. Content process shutdown hangs (example query)
    c. Non-shutdown hangs we'll explicitly flag (see bug 1826703)
    d. Content-process hangs which I don't remember if/how we deal with right now
Assignee: nobody → willkg
Status: NEW → ASSIGNED

I want to add a report_type field that will take keywords. We'll start it off with:

  • hang -- different kinds of hangs
  • crash -- regular crashes and anything that didn't get categorized as something else; e.g. it'll default to "crash"

We'll make this a public field. We'll index it as a keyword so it's available in search and aggregations.

Gabriele: Does that work for you? I can't tell from comment #5 if you wanted to further break it down by different hang types or not.

Flags: needinfo?(gsvelto)

Yes it's fine. There we'll be different types of hangs but we don't need to tell them apart at that level, we can always facet on the hang type later (like we do with the crash reason with crashes). What we care about is that users can clearly tell apart hangs from crashes.

Flags: needinfo?(gsvelto)
Summary: [tracker] split crash reports into crashes, shutdown hangs, and content-process hangs → split crash reports into crashes, shutdown hangs, and content-process hangs

I've finished up most of the changes we need to do here.

The only thing left is to figure out is how to transition from the old system (no report_type field indexed) to the new system (report_type field indexed). I think we have to treat all crash reports with no report_type as a "crash". That means TopCrashers report for report type "crash" will still include hangs depending on whether old data is in the range.

If we think this is going to be confusing to everyone, the alternative is to land the processor and indexing changes now, let data with a report_type accumulate and then land the TopCrashers changes after we have 4 weeks of data--TopCrashers has Days values of 1, 7, 14, 28.

I'll finish this up the week of July 17th.

This needs testing on stage. I'm pretty sure I got the handling for old data in the Top Crashers report correct.

Further, before it gets deployed to production, I need to send an email to stability, crash-reporting-wg, and firefox-dev mailing lists about the changes to the Top Crashers report and how the old data is handled.

That will auto-deploy to the staging environment.

I need to verify the following on stage:

  1. does the topcrashers report work when crash reports don't have a report_type field (it's not in the index mapping)?
  2. (next week) does the topcrashers report work when some crash reports have a report_type field (index has the field) and some don't (last week's index doesn't have the field)?

I looked at stage and because there's no report_type data and the Top Crashers report defaults to report_type="crash", the Top Crashers report is empty. That will be terrifying to anyone looking at it for the first few weeks after it goes to production.

I'm going to change the default to report_type="any". After this goes to production, we can write up a bug for changing the default at some point in the future. Maybe in a month.

I checked stage this morning and the filter appears to be working as expected.

I sent an email to the stability and crash-reporting-wg mailing lists about the upcoming change.

All the changes so far were deployed to prod just now in bug #1843869.

There won't be any report_type data in the search index until a new index is created this weekend so the "crash" and "hang" filters won't have any results until next week.

I'll keep this open until next week after I verify everything is working as expected and write up a bug to change the report_type default from "any" to "crash".

I wrote up a bug for changing the default. I think we're good here.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: