Closed Bug 470620 Opened 14 years ago Closed 13 years ago

Create a new stat: crashes per user

Categories

(Socorro :: General, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: samuel.sidler+old, Assigned: lars)

References

Details

(Whiteboard: cloud)

I'd like to have a new report generated that tells us crashes per user. This is different than what can be gathered from the MTBF reports now.

Currently, MTBF reports also tell us: Firefox 3.0.4- MTBF 33425 seconds based on 2200917 crash reports of 1926046 users (blackboxen) from period between 2008-11-05 and 2009-01-03

From which, we can do some quick division and see how many crashes happen per user who reports them.

The new stat I'd like to see would take total crashes on a given day for a given version and divide it by the total number of blocklist pings we had for that day. We should also be able to drill down by OS.

For example (all data is fake), on December 1:
  * 3000 crashes for Firefox 3.0.4 Mac
  * 1m blocklist pings for Firefox 3.0.4 Mac.
  * 3000/1,000,000 = 0.003 crashes per user

Obviously this stat doesn't take into account the number of users who don't report crashes at all, but I don't care about that. I want a number that we can track over time.

I'd also like to graph this stat and provide a tab for the daily data in use.

chofmann might say we should average this data over a ten day window. He might be right. But I'll let him sell it... Doing so would take out the weird bumps we're seeing with MTBF right now. (We should probably move to a day ten day floating window for MTBF as well...)
In sam's example the sessions (1m) that ended in crash reports (3000) is real interesting data and calculating per day would be interesting.

smoothing this by accumulating the date for the last 10 days (or maybe 7 days to avoid the decline in use over weekends) would also be interesting data to look at.

-chofmann
(In reply to comment #1)
> smoothing this by accumulating the date for the last 10 days (or maybe 7 days
> to avoid the decline in use over weekends) would also be interesting data to
> look at.

Not sure you need to avoid the decline in use over the weekends. The amount of submitted crashes should go down with it, really.
the purpose of the averaging over a longer period is to smooth out the data to see the trends more easily.   having some ten day periods with two weekend, and other ten day periods with only one weekend will introduce noise in the smoothing.

a 7 or 14 day sliding window means we will have more uniform smoothing.  It might also be interesting to show the individual daily calculations as points on the graph, and then show the 7 and 14 day smoothed series as a line using the same color for each release.
We no longer collect user id or data that is (reasonably easy) to connect to a particular user. Thus we cannot add this functionality.
not precisely accurate.

the clients themselves could be taught to keep track of their crash count and we could do some evil math to try to discard their older crash counts.
timeless makes a good point in comment 5. We could do other things on the client side, too, such as looking at sequences of crashes per user. We are apparently already able to store crash data on the client (see bug 495700)
Extending from comment 5 and comment 6. 

We would have to get the client to send us another field in the crash report if we want to summarize such things for our own purposes. Right now, we have header lines for OS, CPU, Crash, Module. We would need one more. Maybe RecentCrashTimes

When it is sent RecentCrashTimes would hold some reasonable number of date-timestamps as known to the client. Some hard count (50?) which could be held in a circular buffer maybe. On the server side, we could count only the ones that are 'recent' by our own definition. If it matters, we can use an offset of servers reports:date_processed versus the most recent stamp to approximate the 'actual' crash dates, since the client won't have the actual date_processed to give us (and we wouldn't trust it anyway)
I don't care about exact data... That's not what I filed here and when I filed this we no longer had data on individual users.

What I want is exactly as I said in comment 0. Take the total number of crashes submitted on a day for a version and divide it by the total number of blocklist pings for that same 24 hour period.

Let's not over-engineer this.
chofmann made this graph which is basically what I want (but with more data, of course): http://people.mozilla.com/~chofmann/crash-data/crashes-per-user.png
Whiteboard: cloud
Target Milestone: --- → Future
This is probably one of the most important reports we don't have.  Bumping priority.
Assignee: nobody → lars
Severity: normal → critical
Target Milestone: Future → 1.2
Target Milestone: 1.2 → 1.3
changing title.  lets not overload cpu...

chart should look something like this.
http://people.mozilla.com/~chofmann/crash-data/crashes-per-1-users.png
Summary: Create a new stat: cpu (crashes per user) → Create a new stat: crashes per user
So pretty!
in looking at data that shows up on metrics, and the times assigned to incoming crash reports during the day it look like they might not be aligned.   

for instance I see 2009-11-10 data all ready on metrics serveral hours before midnight many maybe the result of using a different zone for adu data.

we should try and align the timezones the best we can or at least understand how that might throw things off.
From CrashKill meeting:

The following wiki page is created manually and will start to be added to the header of the CrashKill notes.

https://wiki.mozilla.org/CrashKill/Crashr
Notes on that wiki page
#crashes / # ADU
Metrics granularity: crashes per 100 users
For throttled builds the metric is adjusted, otherwise the adjusted column is empty
can anyone comment on the possible time align problem mentioned in comment 13 ?

are adu's on GMT and socorro timestamps PST or something like that?
ADU's are on GMT.
Its turning out to be hard to use the adu and crash volume numbers right now since adu's are reported in gmt and the time snaps of data I have are pst

See the prototype crash per user report for various releases at 
https://wiki.mozilla.org/CrashKill/Crashr#Release_Snapshots

this has the biggest effect when releases ramp up by a significant number of users in a single day

have we thought about converting all our collection systems to the same timezone?
for the purposes of the crashes per user reports we could 

  just calculate crash numbers using the cut off time of 4pm pacific each day.

or we could

   change socorro over to using gmt 

the later has the advantage of syncing all the data up to adu collection and maybe other data gathering systems.
Blocks: 534697
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.