Closed Bug 808126 Opened 7 years ago Closed 7 years ago

Crash report provider for Firefox Health Report

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(2 files, 1 obsolete file)

Firefox Health Report is supposed to gather the following information pertaining to application crashes:

* Submitted crash count per day
* Pending crash count per day

I just talked with Ted and there are a few issues that need to be reconciled.

Crashes are stored one directory level up from the profile directory. So, crashes are shared by all profiles for a particular user account. If multiple profiles for the same user are sending crash report counts, we'll have double counting. Given the (presumed) low number of people that run multiple profiles, this may not be an issue.

There is also an issue of counting crashes. In the current implementation of the crash reporter, obtaining counts from Firefox itself may miss data. Specifically, users may delete or not save crash reports. Only the crash reporter application knows of every crash report. Knowing this, our current proposed solution of inspecting the crash reports directory doesn't feel robust enough to me. It *may* be good enough for initial roll-out of the feature. But, we should pursue more robust and complete crash counting.

I'm proposing a slight change to the data gathered. Instead of distinct measurements for daily and pending crash reports, I believe that we should instead record:

* Daily crash count
* Daily submitted crash count

The daily pending crash count can be inferred from the difference between the two.

Furthermore, I recommend that the crash reporter write a "log" of activity in the profile directory. It would likely record the time and ID of *every* crash as well as when crashes are submitted. This would provide a bullet-proof count of per-profile (not per-user) crashes and would deliver accurate data to FHR. There is a special case when the crash occurs creating a profile. We can probably have a supplemental log one directory up for these. We should consider exposing these through FHR as a separate metrics.

OK, I think I captured everything. I'm sure Ted will weigh in if I missed anything.

Daniel: can you please clarify what Metrics wants and whether we should pursue a phased roll-out (with partial data up front, etc)?
Flags: needinfo?(deinspanjer)
I'm fine with the slight modification to daily crash count and daily submitted crash count.

For the initial implementation, I think we should go ahead with the simple method of counting crashes in the reports directory.  We considered the possibility of double counting crashes for installations that have multiple installs, but we don't feel this is common enough to be likely to skew the numbers.  We will be able to verify that assumption by comparing the aggregate volume of submitted crashes per day for a specific build with the number of observed crashes from the Socorro system, so we'll know if it is a significant problem.

In the longer term, I love the idea of having crash reporter write out a log of crash activity into the profile directory.  This would eventually allow a much richer analysis of stability problems.  For instance, it could log the different types of reports: core crash, plugin crash, content crash, content hang, plugin hang.
If/when something like that form of logging comes into existence, we should definitely look at introducing it as new FHR data points.
Flags: needinfo?(deinspanjer)
I dig the proposed changes. As far as the UX goes, using that data will be a lot more simple than working backwards from the sum of crashCountPending and crashCountSubmitted to see how many crashes occurred in a single day.

Also with the log data it would be trivial to link to the signature on crash-stats so users can see before hand if there are any open bugzilla tickets related to the problem they are having. (by ID you mean the signature, right?)
Upon considering the consequences, I think the proposed changes are absolutely necessary. Each profile has their own set of add-ons, and if I recall correctly the ability to correlate crashes to add-ons is a major advantage to the FHR data. Though an analyst would have to weigh in if this case has any statistical significance.
Since we are versioning metrics measurements, we can always record the data we can easily get now then increment the version number when we change the collection semantics to be more robust. Analysts can make use of data today if it's useful or wait until the next version becomes available.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #1)
> For the initial implementation, I think we should go ahead with the simple
> method of counting crashes in the reports directory.  We considered the
> possibility of double counting crashes for installations that have multiple
> installs, but we don't feel this is common enough to be likely to skew the
> numbers.  We will be able to verify that assumption by comparing the
> aggregate volume of submitted crashes per day for a specific build with the
> number of observed crashes from the Socorro system, so we'll know if it is a
> significant problem.

Greg and I discussed this, and there are a few other problems with this as a source of data:
1) Crashes that are not submitted by the user are not counted. (If you uncheck the "submit a crash report" box on the crash reporter, the crash report is removed from your system and we have no record of it.)
2) There's no easy way to distinguish between browser and plugin crashes (or plugin hangs, for that matter). They are all stored in the same place.
Depends on: 810543
Plugin hangs at least aren't double reports any more, at least if you're counting usefull. They will show up as a single .extra file and multiple .dmp files, e.g.

12345678-1234-1234-123456789012.extra
12345678-1234-1234-123456789012.dmp
12345678-1234-1234-123456789012-browser.dmp
12345678-1234-1234-123456789012-flash1.dmp
12345678-1234-1234-123456789012-flash2.dmp

Also if you have more than 10 pending reports, we throw them away since they are just eating disk space.

Trevor, we already correlate crashes to addons quite reliably in crash-stats: I don't think it's a FHR goal to do that, and indeed given the current set of data there isn't any way to do that because all FHR has is counts, not any signature data. Crash IDs and signatures are entirely separate, and we should talk along with the SUMO folk about whether/how crash reports can be used to help individual users; there's a set of longstanding bugs about this.
I'll look at this.
Assignee: nobody → rnewman
Status: NEW → ASSIGNED
Can I get confirmation from the metrics folks that a count of crash reports is still useful?

From my reading of this bug, it sounds like we'll be getting inaccurate, incomplete counts -- only crashes that the user submitted, no more than ten (though I have 163 crash reports in my pending dir), no distinction between browser and plugin crashes.

The alternative is that we add logging to crash reporter for our first stab at this.

Are crash reports ever cleaned up, thus causing the count to fall from day to day?
Flags: needinfo?(deinspanjer)
There's no automated cleanup of the submitted directory, but there is a "Remove all Reports" button in about:crashes that users can use to remove all local records of submitted crashes. The crash reporter client will attempt to clean up the pending directory to limit it to 10 crash reports, but that only happens when you experience a browser crash, not just a plugin crash.
rnewman: Confirming that count of crash reports is still useful. We would like to have a count for pending as well as submitted in a daily data point. An example of an install that submitted 1 report and has 2 pending that all happened on 9/25/2012 would be (format is from metrics prototype):

"2012-09-25": { "crashCountSubmitted": 1, "crashCountPending": 2 }
Hey guys. Just wanted to pop in and let you know that we've got a resident crash expert on SUMO willing to work with you on this. I'm adding Tyler Downer to this bug. Please include him in the crash discussions. It would probably be really helpful to have him come to the meetings as well. His crash-fu is epic.
… back on the stack; other things have my attention.
Assignee: rnewman → nobody
Status: ASSIGNED → NEW
Attached patch Provider for crash reports, v1 (obsolete) — Splinter Review
Still needs lots of work.

Ted: could you validate that I'm obtaining crash reports in the most robust way possible in the current world?

Also, this needs to land in Firefox 20. So, ignore all my build patches until after this one ;)
Attachment #695559 - Flags: feedback?(ted)
This needs to land in 20. I'm assuming Ted won't get around to it within the next few hours. We'll backport any changes he may find.

Long term I'd like some of this code to live in crash reporter. We don't have time to land it there now.
Assignee: nobody → gps
Attachment #695559 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #695559 - Flags: feedback?(ted)
Attachment #698464 - Flags: review?(rnewman)
Attachment #698464 - Flags: feedback?(ted)
Comment on attachment 698464 [details] [diff] [review]
Provider for crash reports, v2

Review of attachment 698464 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine to me (albeit with a cursory Sunday review).
Attachment #698464 - Flags: review?(rnewman) → review+
Sadness.

If the directory does not exist (which should never happen in Firefox because it creates the Crash Reports directory on startup - correct me if I am wrong Ted), this will cause collectConstantData() to reject. This will result in an error being logged in Collector. i.e. it will only impact the provider and nothing else. Hopefully.
Attachment #698538 - Flags: review?(rnewman)
Comment on attachment 698538 [details] [diff] [review]
Part 2: Work around OS.File limitations, v1

Review of attachment 698538 [details] [diff] [review]:
-----------------------------------------------------------------

Sadness.
Attachment #698538 - Flags: review?(rnewman) → review+
https://hg.mozilla.org/mozilla-central/rev/03759402e433
https://hg.mozilla.org/mozilla-central/rev/634951119fef
https://hg.mozilla.org/mozilla-central/rev/bc1f7b5b8378
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [fixed in services]
Target Milestone: --- → mozilla20
Comment on attachment 698464 [details] [diff] [review]
Provider for crash reports, v2

Review of attachment 698464 [details] [diff] [review]:
-----------------------------------------------------------------

Longer-term you should file a followup to add code to actually collect the metrics you care about here. It'd be fairly simple to add a count of browser crashes, plugin crashes, content process crashes, and crashes submitted to the crash reporter code.

::: services/healthreport/modules-testing/utils.jsm
@@ +194,5 @@
> +           OS.Constants.libc.S_IRGRP | OS.Constants.libc.S_IROTH;
> +  } else {
> +    paths.push("pending");
> +    filename = id + ".dmp";
> +    mode = OS.Constants.libc.S_IRUSR | OS.Constants.libc.S_IWUSR;

If you're creating fake pending crashes you should create a matching .extra file for consistency.
Attachment #698464 - Flags: feedback?(ted) → feedback+
Flags: needinfo?(deinspanjer)
You need to log in before you can comment on or make changes to this bug.