Closed Bug 594777 Opened 14 years ago Closed 14 years ago

implement a map/reduce job to produce a list of modules from a day's worth of crash reports

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: xstevens)

References

Details

Currently, to fill in missing symbols for Windows crash reports, I have a script that hits the Socorro ATOM feed for the 500 most recent Windows crashes, and looks at their module lists to find missing symbols.

This isn't great, since 500 crashes a day is a pretty small sample, and as bug 575817 indicates, we're still missing a lot of symbols. I'd like to instead use a map/reduce job to provide the input, since we should be able to run against a much larger sample set (like the entire day's worth of crashes).

I'm happy to help write the map/reduce job here, although I don't know the first thing about Hadoop. I also don't know exactly how we'd make the output available to my script for further processing. Just make it available via HTTP somehow?
The logic would be something like:

for every crash in the set:
 if this is not a Windows crash, skip it

  otherwise, take all lines starting with Module| from the raw dump, split them on '|', and insert fields 1,3,4 (zero-indexed) as a row in the result set

for reducing, remove duplicate rows to get only unique rows in the output.
Assignee: nobody → xstevens
Ted-,

I have something like this written already.  I'll modify it and clean it up a bit for you.
Do you want comma-delimited output or a different delimiter?  What about entries that have blank fields 3 and 4?  Other than these minor issues I've got a job that we can run.  We'll probably want to wrap a shell script around it to get the results off of hadoop and put them somewhere.  It'll just be plain text files so you can do what you want with them from there.
CSV is fine. Entries with blank fields 3 and 4 can be dropped from the output. Thanks!
Hey Ted,

So here is some sample output:

11.2.9117.0.nmcorePS.dll,nmcorePS.pdb,73387E65FD5D4F3A9B7CC306CBAF3CA41
1445070.dll,FirefoxExt35.pdb,1F888C3FB0FC426D9384F92307B382501
228078g07.dll,DGJR.pdb,A40D46D92CA54937A8520538E0A17D773
3gppttrenderer.dll,3gppttrenderer.pdb,780BC6C76FD04AC49E916DD05D78A8705
813250m16t.dll,MWDL.pdb,449BF30EF5BB4500A4696E80237C51DD2
821109m16t.dll,MWDL.pdb,449BF30EF5BB4500A4696E80237C51DD2
836312m16t.dll,MWDL.pdb,449BF30EF5BB4500A4696E80237C51DD2
ACE.dll,ACE.pdb,B477FA3428A740B7A0C625CB561F7A1C1
ACE.dll,ACE.pdb,FAD341203BFF471B84128EFF1FBF0F561
ACTIVEDS.DLL,activeds.pdb,3B7DE0562

If this looks good to you I can check this in and we can deploy this out somewhere so it can be executed on a daily basis.
This looks good. Looking at the output, though, I realize that you could drop column 1, since I don't actually need it. It looks like there are some DLLs there that change their name but not the other info (probably spyware/viruses), so it should reduce the size of the output.
I could use column 1 for some stuff I'm doing, so if possible lets keep it.
It's not a big deal either way for me, just something I realized after seeing the duplicate data in the output in comment 5.
I'll leave in the dll then.  9/12/2010 on production yielded a list of 47,366 entries.  Just going to include my series of commands here for recording purposes.  I can document this on socorro wiki or something if we want to later.

hadoop jar socorro-analysis-job.jar com.mozilla.socorro.hadoop.CrashReportModuleList -Dproduct.filter="Firefox" -Dos.filter="Windows NT" -Dstart.date=20100912 -Dend.date=20100912 module-list-out
hadoop fs -getmerge module-list-out modulelist.txt
sort modulelist.txt -o modulelist.sorted
Going to set this to fixed.  Will work with Laura and team for deployment.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Okay. If you file followup bug(s) on deployment, please make them block bug  	 575817.
Blocks: 598908
I forgot about this bug; I guess this is similar to the request in Bug 634498.   good to see it coming on line soon.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.