Closed Bug 521917 Opened 15 years ago Closed 15 years ago

need nightly crash correlation reports across all firefox releases.

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: chofmann, Assigned: aravind)

References

Details

(Whiteboard: [crashkill][crashkill-metrics])

Attachments

(1 file)

currently 
aravind gives dumps json ziped files for one day of crash reports firefox 3.5.3

then dbaron runs his script to give us valuable information that tells us correlation of the crashes to .dll's that get loaded.  output happens here

http://dbaron.org/mozilla/topcrash-modules

we need to:
 get this process executing every night on the previous day's worth of data,  
and
 expanded across all firefox releases; especially the most latest/populated 3.0.x release which is now 3.0.14.   

the reports should start out in the exact text form that dbaron is producing now.

dbaron's scripts are here.

http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/
Summary: need → need nightly crash correlation reports across all firefox releases.
Ideally, we want this report run for all versions of all products. e.g.,

  * Firefox 3.0.x
  * Firefox 3.5.x
  * Thunderbird 3.0.x
  * SeaMonkey 2.0.x

If it's too much of a memory/CPU hit to do all versions, let's just get up the latest versions of each product. e.g., Firefox 3.0.14, Firefox 3.5.3, Thunderbird 3.0b4, and SeaMonkey 2.0 (shipping soon).
OS: Mac OS X → All
morgamic: This is something that we'd love to get up sooner rather than wait for the "new system" that metrics is working on. It's a pain to do this manually now because it requires getting the data from Aravind (something like 2gb worth), putting it on an internal machine, and running the scripts each time. Doing that daily by hand is hard, for obvious reasons.
yeah, sooner as in within a week, is the way we should try and think of this.  maybe its more an IT/aravind thing, since its more about hooking up existing scripts on a production system.  Is that a good way to think about it?
Yeah, this belongs more with IT than webdev.  Transferring it to our group and taking it.
Assignee: nobody → server-ops
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: Trunk → other
Assignee: server-ops → aravind
We should be able to create views using this data that could integrate with the existing system.  If this gets prioritized above other stuff, we could start working on this first, but we're waiting on Daniel to give us the metrics team goals...
(In reply to comment #0)
> dbaron's scripts are here.
> 
> http://hg.mozilla.org/users/dbaron_mozilla.com/crash-data-tools/

The reports that I've been running are:

per-crash-core-count.py > YYYYMMDD-core-counts
per-crash-interesting-modules.py > YYYYMMDD-interesting-modules
per-crash-interesting-modules.py -v > YYYYMMDD-interesting-modules-with-versions
per-crash-interesting-modules.py -a > YYYYMMDD-interesting-addons
per-crash-interesting-modules.py -a -v > YYYYMMDD-interesting-addons-with-versions
re: comment 5

the last incarnation of the "top 5 questions" to be answered by the metrics team about crash data lists this general question:

   Figuring out what's unique about a specific crash signature. Can we tell if the problem is plugin related or not?

and this more specific one:

   3. What non mozilla libraries (dll files) have been most referenced in crash stacks (in past 3/7/14/30 days)? 

dbaron's tools get us a very long way along the path of answering those questions, and just automating what he has to produce nightly reports  and expanding to branches will be tremendous progress for this quarter and will hit a goal target IMHO.  

There are 4 other big goals on Ken's list. I'd say these other 4 areas are big opportunities for web dev contributions to get other reporting we need in other aspects of understanding crashes.   

After knocking out some reports for those 4 other areas it would be good to circle back around to improve the views of this data beyond what dbaron has done so far.
Getting ready to work on this stuff.  How would you guys like to get these reports?  Do they contain sensitive material?  Or can I just dump them on people?
I don't think these reports contain sensitive information.  They are mostly (entirely?) correlations pulled out of public data.

Ideally, these reports would also be integrated into Socorro, so I can load the page about a signature and learn whether it is correlated with {multicore, extensions, modules, ...}.  And I'd be able to drill down, getting a list of matching crash reports at each level, and being able to tell whether a crash continues to be correlated with an extension even on single-core machines.  But now I'm dreaming ;)
The input data in the reports are actually all public (although pretty well hidden); the jsonz files that it uses as input are visible from every crash report page as a <link rel="alternate" />.  And in aggregate, it's less sensitive (to the individual users).
Jesse: We'd certainly like to do that eventually, but in the interest of getting these live sooner rather than later, we're doing the bare minimum.

Aravind: If you can have these reports publish a text files daily with the date in the name (something like "reportname.20091021.txt"). There's no private information in them. Dumping them on people is perfectly fine, though preferably web-visible.
Whiteboard: [crashkill]
Please take a look at http://people.mozilla.com/crash_analysis/20091022/.  The specific versions that I will be generating those reports for will depend on whats most popular.  I am just picking the two most popular versions for each of the products and generating those reports.

If that looks good, I will go ahead and cron this stuff.  The entire process takes around 4 hours, so it should be ready by 6:00 AM or so (PST).
That looks mostly good except the output of a few of them seems wrong. Anything under 1k is definitely not a full report, especially if it's a Firefox version. For SeaMonkey/Thunderbird is there a minimum threshold? That'd explain why some of the reports don't fully generate. I'll admit I haven't read the scripts...
My scripts only generate output for crash signatures with at least 10 incidents.
That said, there's definitely something wrong.  The things that have a report "-with-versions" and one without should have pretty similar reports, except the "-with-versions" ones should have the versions listed; right now the ones without versions are blank while the ones with versions look reasonable.
Did the scripts throw an exception for some reason?  I'm guessing it's some python Unicode issue...
(In reply to comment #16)
> Did the scripts throw an exception for some reason?  I'm guessing it's some
> python Unicode issue...

Nope, no exceptions at all.  Maybe I messed something up with my changes?
I don't know how that could happen if there weren't exceptions thrown.

The changes you checked in are a long distance away from the output code; they're all in the crash-reading code.
Here is the output from a full run of the scripts.  No exceptions, a few messages about bad jsonz files, but no exceptions.
Oh wait.. it looks like this automated run from the script seems to have the right results.  I must have screwed something up when I ran things manually the first time.  Please look at the reports now and let me know if they look okay.
Though I'm puzzled by the fact that both of these files:
20091022_Firefox_3.5.3-core-counts.txt
20091022_Firefox_3.5.3_core-counts.txt
are present (and that they're different sizes).
Sorry, those were left over from the previous run.  I cleaned those out.  They are different sizes because the uuids are the set of crashes from the last 24 hours (from when I run the queries), and not exactly the list of crashes from the last day.  I figured we only really care about general trends and not about the exact list from the last day.  That said, I am planning on kicking this query off around midnight, so the last day and the last 24 hours should be the same.  I am putting this job into cron now.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Awesome! Thanks Aravind!
Status: RESOLVED → VERIFIED
Whiteboard: [crashkill] → [crashkill][crashkill-metrics]
Blocks: 464775
Depends on: 470827
No longer depends on: 470827
No longer blocks: 464775
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: