implement a map/reduce job to produce a list of modules from a day's worth of crash reports



9 years ago
7 years ago


(Reporter: ted, Assigned: xstevens)


Dependency tree / graph

Firefox Tracking Flags

(Not tracked)


Currently, to fill in missing symbols for Windows crash reports, I have a script that hits the Socorro ATOM feed for the 500 most recent Windows crashes, and looks at their module lists to find missing symbols.

This isn't great, since 500 crashes a day is a pretty small sample, and as bug 575817 indicates, we're still missing a lot of symbols. I'd like to instead use a map/reduce job to provide the input, since we should be able to run against a much larger sample set (like the entire day's worth of crashes).

I'm happy to help write the map/reduce job here, although I don't know the first thing about Hadoop. I also don't know exactly how we'd make the output available to my script for further processing. Just make it available via HTTP somehow?
The logic would be something like:

for every crash in the set:
 if this is not a Windows crash, skip it

  otherwise, take all lines starting with Module| from the raw dump, split them on '|', and insert fields 1,3,4 (zero-indexed) as a row in the result set

for reducing, remove duplicate rows to get only unique rows in the output.


9 years ago
Assignee: nobody → xstevens

Comment 2

9 years ago

I have something like this written already.  I'll modify it and clean it up a bit for you.

Comment 3

9 years ago
Do you want comma-delimited output or a different delimiter?  What about entries that have blank fields 3 and 4?  Other than these minor issues I've got a job that we can run.  We'll probably want to wrap a shell script around it to get the results off of hadoop and put them somewhere.  It'll just be plain text files so you can do what you want with them from there.
CSV is fine. Entries with blank fields 3 and 4 can be dropped from the output. Thanks!

Comment 5

9 years ago
Hey Ted,

So here is some sample output:


If this looks good to you I can check this in and we can deploy this out somewhere so it can be executed on a daily basis.
This looks good. Looking at the output, though, I realize that you could drop column 1, since I don't actually need it. It looks like there are some DLLs there that change their name but not the other info (probably spyware/viruses), so it should reduce the size of the output.

Comment 7

9 years ago
I could use column 1 for some stuff I'm doing, so if possible lets keep it.
It's not a big deal either way for me, just something I realized after seeing the duplicate data in the output in comment 5.

Comment 9

9 years ago
I'll leave in the dll then.  9/12/2010 on production yielded a list of 47,366 entries.  Just going to include my series of commands here for recording purposes.  I can document this on socorro wiki or something if we want to later.

hadoop jar socorro-analysis-job.jar com.mozilla.socorro.hadoop.CrashReportModuleList -Dproduct.filter="Firefox" -Dos.filter="Windows NT" module-list-out
hadoop fs -getmerge module-list-out modulelist.txt
sort modulelist.txt -o modulelist.sorted

Comment 10

9 years ago
Going to set this to fixed.  Will work with Laura and team for deployment.
Last Resolved: 9 years ago
Resolution: --- → FIXED
Okay. If you file followup bug(s) on deployment, please make them block bug  	 575817.
Blocks: 598908

Comment 12

8 years ago
I forgot about this bug; I guess this is similar to the request in Bug 634498.   good to see it coming on line soon.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.