If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

abnormally high rate of bad data in 'time since last crash'

RESOLVED WONTFIX

Status

Socorro
General
RESOLVED WONTFIX
8 years ago
a month ago

People

(Reporter: chris hofmann, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

8 years ago
looking at a sample of a weeks worth of 3.5 beta 4 data I see the following

only 112665 of 271776 reports contained time since last crash between 0 and the last two years.

23833 of the reports have \N in the time since last crash field
104 have negative numbers
most of the rest of the missing or invalid reports have blank data in the field

this has implications for mtbf reporting and we probably need to investigate several questions to understand what is going on.  we can use this a tracking bug for these questions

what causes \N and blanks to appear in the reports?

what causes negative numbers to appear in the reports?

if clock resets, or invalid times on the PC account for strange times in the crash reports are these conditions also associated with firefox bugs and crashes?  e.g. any place firefox/gecko expects some rational time value, and an irrational time value gets grabbed from the system there maybe bugs lurking
(Reporter)

Comment 1

8 years ago
talked to griswold last night about two things for the mtbf report

 if time since last crash is greater than zero and less than 2 years its ok to use

and 

we also might want to add some data reliability note on the report like
"112665 of 271776 reports for 3.5 contained valid times"

also tracking the volume of crashes correlated to the size of the installed base for each day of a major release would be valuable on the MTBF report as an addtional measure of of 'crashiness'  until we get a better understanding of what these 'time since last crash' values really mean.
Here's the path of "time since last crash" through Socorro:

1 - it comes from the client packaged as a string value of digits coupled with the name "SecondsSinceLastCrash" in an http POST to Socorro Collector.

2 - Socorro Collector accepts the form data, places the data as key/value pairs into a dictionary and writes them out as a json file.  No transformation or validation of the data takes place in Collector.  If no "SecondsSinceLastCrash" field exists in the form, the resultant json file will have no "SecondsSinceLastCrash" key.

3 - some time later, Socorro Processor reads the json file and constructs the reports record from it.  The key "SecondsSinceLastCrash" is accessed and cast to integer. Python has handles arbitrarily large integers, so does not suffer overflow errors.  If the "SecondsSinceLastCrash" key does not exist, it creates an entry for it using the Python None type.

4 - Socorro Processor saves the report record to the database in a column of type 'integer'.  Type integer in PosgreSQL is signed and uses 4 bytes.  (Ad hoc test shows that it is 4 bytes long even on a 64bit install of PostgreSQL).  If the value is of type None, NULL will be used as the value for the column.
For reference, SecondsSinceLastCrash is calculated here:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#196
XP_TTOA is either ltoa or _i64toa on Windows, and sprintf on other platforms. The code is kind of ugly since it tries very hard not to allocate memory in the exception handler.
(Reporter)

Comment 4

8 years ago
I was injecting some extra noise into my analysis script, so it actually looks like there are no 'blank' TimeSinceLastCrash fields in the database, so no need to figure those out.

weeding out the bad data I was looking at before it now looks like:

135888 total reports in the one week of 3.5b4 data in the sample I'm looking at
23033 \N reports
104   negative number time since last crash
86    reports with time since last crash older than 2 years
(In reply to comment #4)
> 23033 \N reports

Note that this number will necessarily be blank for the first crash reported by a user.
(Reporter)

Comment 6

8 years ago
ted's comment 5 sounds interesting and probably accounts for most of he \N's
16% of our crashes coming from entirely new users might be a reasonable number.
It would also be interesting to track that over time to see the dynamics.

I also looked at how many reports we would toss if we closed the window tighter.
86   outside a 63000000 second or two year window   
148  outside a 32000000 second or 1 year window
576  outside a 16000000 second or 6 month window
2248  outside a 8000000 second or 3 month window
5762  outside a 4000000 second or 6 week window
11248 outside a 2000000 second or a 3 week window 
18256 outside a 1000000 second or week and a half window

It looks like a very high pct. of our reports are coming from users that say the last time they crashed was less that a month ago.  that would make sense if the time-since-last crash data gets put with program files (not user profile) and if 3.5b4 users installed the beta into a new program directory.

It would also be interesting to track the distribution of time-since-last crash like above and also uptime.
The time-since-last-crash data is stored near the profile data (but not in the profile), so it's global across all Firefox versions on the same OS user profile.
Choffman, Ted - Is this bug still an issue?
(Reporter)

Comment 9

8 years ago
yeah, the data is still bad, but I'm not sure what we can/should do about it.  any scripts that use it in calculating uptime just need to be aware.
I don't think there's anything we can do about it aside from sanitizing the data more in Socorro. We could just throw out data that looks really wrong.
(Reporter)

Comment 11

8 years ago
I think its better to just flag the data when doing any analysis on it.  It could be used for such things determining how many people have bad clock settings or some other problem on there system.  In some cases bad system time on the crashing pc might be a contributing factor in the crash or general problems with the crashing PC.
(Reporter)

Comment 12

7 years ago
interesting study on bad clock settings on pc's

http://www.codinghorror.com/blog/2007/01/keeping-time-on-the-pc.html
(Assignee)

Updated

6 years ago
Component: Socorro → General
Product: Webtools → Socorro
Selena: can you look at recent crashes and see if this is still true? If not we can WONTFIX.
Flags: needinfo?(sdeckelmann)
From a sample of 296958 raw crashes on 4/7/2014 (GMT):
* 87641 contained NULL (aka \N), or about 30% 
* 134 records contained a negative value

That's significantly more records that :choffman reported earlier.

I sampled a couple other days and still got around 30% containing NULL for SecondsSinceLastCrash.

From earlier comments, this seems like a UI labeling concern rather than a data transformation issue at this point... Let me know if there's something more we'd like to do with analyzing the data.
Flags: needinfo?(sdeckelmann)
crash-pings help with this, since we get timestamp received for every crash on the server
Status: NEW → RESOLVED
Last Resolved: a month ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.