Closed Bug 651543 Opened 14 years ago Closed 14 years ago

Analyze duplicate data

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Unassigned)

References

Details

As a result of bug 629088, we now have data on which crashes look like dupes (according to our first run algorithm, anyway). I've noticed from looking at the UI on staging that there are some crashes that have a lot of dupes, and some that have none. Finding the pattern may help to solve the underlying problem. Let's run some queries against the PostgreSQL database to slice dupe and non-dupe crashes in different ways: by version, by platform, by time since startup, perhaps? (Ted noticed at least one of the dupe crashes was a startup crash) We could also do a more complex clustering analysis using HBase. Crashkill team: let us know what you think.
yeah, definitely some additional analysis would help to figure out the problem. we notice this a lot on mozilla-central/nightly builds where the volume is low and the dups are easier to spot, and the thesis proposed that it is far more prevalent on some signatures rather than others fits with anecdotal examination of the data. it would be good to see if the pattern/volume continues from nightlies, into beta, and final release, and also for the first few days of the unthrottled 4.0 release data collection. josh also mentioned the possibility of looking for OS version dependencies. win,mac, linux would be interesting but it probably be more useful to go down to OS version info rounding off to minor level like win 5.1 win 6.0 win 6.1 mac 10.4 mac 10.5 mac 10.6 lin 2.6
also correlation to startup crashes v. other crashes, high pct. of crashes happen within 3 minutes since start.
We'd expect the percentage of startup crashes among the "dupes" to be significantly higher than in the rest of crashes (due to people going into a cycle of trying to start Firefox and crashing, trying to restart, etc.) - but it would be interesting to see actual data on that.
> We'd expect the percentage of startup crashes among the "dupes" to be > significantly higher than in the rest of crashes (due to people going into a > cycle of trying to start Firefox and crashing, we still need to figure out how many of the dups are "user initiated" v. "non-user initiated", and under what conditions the "non-user initiated" crashes are happening so there is more to learn here I thing, but I agree with your theory.
(In reply to comment #3) > We'd expect the percentage of startup crashes among the "dupes" to be > significantly higher than in the rest of crashes (due to people going into a > cycle of trying to start Firefox and crashing, trying to restart, etc.) - but > it would be interesting to see actual data on that. This is what I am seeing in THunderbird
Blocks: 579136
Depends on: 629088
Chris, Robert, Can you give me a list of reports you'd like me to run, including any filtering and grouping levels? I'm happy to do these, but I need some specifics for Laura to approve before I run them. Also, is CSV format OK with you, or do you want the data some other way? CSV, XML and Postgres tables are easy, other formats will require more time. Thanks!
maybe the best way to do the reporting on this is to add it to the existing pub-crashdata and url .csv files. https://crash-analysis.mozilla.com/crash_analysis/20110502/20110502-pub-crashdata.csv.gz that would allow us to try and correlate reports using lots of other crash meta data.
What's the action here?
Target Milestone: --- → 2.0
we added install age to the nightly .csv reports, but adding the marking on the reports that we are also marking as dups in the database would be useful as well. I suggest that we add a "dup" field to the .csv reports like mentioned in comment 7. then we can use that to start to correlate dups against other crash meta data.
(In reply to comment #9) > I suggest that we add a "dup" field to the .csv reports That's part of bug 655750.
I then I think we are done here unless josh has something he wants to do.
Actually, it's bug 658049 as the other one needed to be split into two parts.
Depends on: 658049
This isn't my bug, I have nothing I want to do. If you want me to run queries or do exports of any specific duplicate information, file another bug.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.