looking at a sample of a weeks worth of 3.5 beta 4 data I see the following only 112665 of 271776 reports contained time since last crash between 0 and the last two years. 23833 of the reports have \N in the time since last crash field 104 have negative numbers most of the rest of the missing or invalid reports have blank data in the field this has implications for mtbf reporting and we probably need to investigate several questions to understand what is going on. we can use this a tracking bug for these questions what causes \N and blanks to appear in the reports? what causes negative numbers to appear in the reports? if clock resets, or invalid times on the PC account for strange times in the crash reports are these conditions also associated with firefox bugs and crashes? e.g. any place firefox/gecko expects some rational time value, and an irrational time value gets grabbed from the system there maybe bugs lurking
talked to griswold last night about two things for the mtbf report if time since last crash is greater than zero and less than 2 years its ok to use and we also might want to add some data reliability note on the report like "112665 of 271776 reports for 3.5 contained valid times" also tracking the volume of crashes correlated to the size of the installed base for each day of a major release would be valuable on the MTBF report as an addtional measure of of 'crashiness' until we get a better understanding of what these 'time since last crash' values really mean.
Here's the path of "time since last crash" through Socorro: 1 - it comes from the client packaged as a string value of digits coupled with the name "SecondsSinceLastCrash" in an http POST to Socorro Collector. 2 - Socorro Collector accepts the form data, places the data as key/value pairs into a dictionary and writes them out as a json file. No transformation or validation of the data takes place in Collector. If no "SecondsSinceLastCrash" field exists in the form, the resultant json file will have no "SecondsSinceLastCrash" key. 3 - some time later, Socorro Processor reads the json file and constructs the reports record from it. The key "SecondsSinceLastCrash" is accessed and cast to integer. Python has handles arbitrarily large integers, so does not suffer overflow errors. If the "SecondsSinceLastCrash" key does not exist, it creates an entry for it using the Python None type. 4 - Socorro Processor saves the report record to the database in a column of type 'integer'. Type integer in PosgreSQL is signed and uses 4 bytes. (Ad hoc test shows that it is 4 bytes long even on a 64bit install of PostgreSQL). If the value is of type None, NULL will be used as the value for the column.
For reference, SecondsSinceLastCrash is calculated here: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#196 XP_TTOA is either ltoa or _i64toa on Windows, and sprintf on other platforms. The code is kind of ugly since it tries very hard not to allocate memory in the exception handler.
I was injecting some extra noise into my analysis script, so it actually looks like there are no 'blank' TimeSinceLastCrash fields in the database, so no need to figure those out. weeding out the bad data I was looking at before it now looks like: 135888 total reports in the one week of 3.5b4 data in the sample I'm looking at 23033 \N reports 104 negative number time since last crash 86 reports with time since last crash older than 2 years
(In reply to comment #4) > 23033 \N reports Note that this number will necessarily be blank for the first crash reported by a user.
ted's comment 5 sounds interesting and probably accounts for most of he \N's 16% of our crashes coming from entirely new users might be a reasonable number. It would also be interesting to track that over time to see the dynamics. I also looked at how many reports we would toss if we closed the window tighter. 86 outside a 63000000 second or two year window 148 outside a 32000000 second or 1 year window 576 outside a 16000000 second or 6 month window 2248 outside a 8000000 second or 3 month window 5762 outside a 4000000 second or 6 week window 11248 outside a 2000000 second or a 3 week window 18256 outside a 1000000 second or week and a half window It looks like a very high pct. of our reports are coming from users that say the last time they crashed was less that a month ago. that would make sense if the time-since-last crash data gets put with program files (not user profile) and if 3.5b4 users installed the beta into a new program directory. It would also be interesting to track the distribution of time-since-last crash like above and also uptime.
The time-since-last-crash data is stored near the profile data (but not in the profile), so it's global across all Firefox versions on the same OS user profile.
Choffman, Ted - Is this bug still an issue?
yeah, the data is still bad, but I'm not sure what we can/should do about it. any scripts that use it in calculating uptime just need to be aware.
I don't think there's anything we can do about it aside from sanitizing the data more in Socorro. We could just throw out data that looks really wrong.
I think its better to just flag the data when doing any analysis on it. It could be used for such things determining how many people have bad clock settings or some other problem on there system. In some cases bad system time on the crashing pc might be a contributing factor in the crash or general problems with the crashing PC.
interesting study on bad clock settings on pc's http://www.codinghorror.com/blog/2007/01/keeping-time-on-the-pc.html
Selena: can you look at recent crashes and see if this is still true? If not we can WONTFIX.
From a sample of 296958 raw crashes on 4/7/2014 (GMT): * 87641 contained NULL (aka \N), or about 30% * 134 records contained a negative value That's significantly more records that :choffman reported earlier. I sampled a couple other days and still got around 30% containing NULL for SecondsSinceLastCrash. From earlier comments, this seems like a UI labeling concern rather than a data transformation issue at this point... Let me know if there's something more we'd like to do with analyzing the data.
crash-pings help with this, since we get timestamp received for every crash on the server