594777 - implement a map/reduce job to produce a list of modules from a day's worth of crash reports

Reporter

Description

•

14 years ago

Currently, to fill in missing symbols for Windows crash reports, I have a script that hits the Socorro ATOM feed for the 500 most recent Windows crashes, and looks at their module lists to find missing symbols. This isn't great, since 500 crashes a day is a pretty small sample, and as bug 575817 indicates, we're still missing a lot of symbols. I'd like to instead use a map/reduce job to provide the input, since we should be able to run against a much larger sample set (like the entire day's worth of crashes). I'm happy to help write the map/reduce job here, although I don't know the first thing about Hadoop. I also don't know exactly how we'd make the output available to my script for further processing. Just make it available via HTTP somehow?

(not currently active) Ted Mielczarek

Reporter

Comment 1

•

14 years ago

The logic would be something like: for every crash in the set: if this is not a Windows crash, skip it otherwise, take all lines starting with Module| from the raw dump, split them on '|', and insert fields 1,3,4 (zero-indexed) as a row in the result set for reducing, remove duplicate rows to get only unique rows in the output.

Xavier Stevens [:xstevens]

Assignee

Updated

•

14 years ago

Assignee: nobody → xstevens

Xavier Stevens [:xstevens]

Assignee

Comment 2

•

14 years ago

Ted-, I have something like this written already. I'll modify it and clean it up a bit for you.

Xavier Stevens [:xstevens]

Assignee

Comment 3

•

14 years ago

Do you want comma-delimited output or a different delimiter? What about entries that have blank fields 3 and 4? Other than these minor issues I've got a job that we can run. We'll probably want to wrap a shell script around it to get the results off of hadoop and put them somewhere. It'll just be plain text files so you can do what you want with them from there.

(not currently active) Ted Mielczarek

Reporter

Comment 4

•

14 years ago

CSV is fine. Entries with blank fields 3 and 4 can be dropped from the output. Thanks!

Xavier Stevens [:xstevens]

Assignee

Comment 5

•

14 years ago

Hey Ted, So here is some sample output: 11.2.9117.0.nmcorePS.dll,nmcorePS.pdb,73387E65FD5D4F3A9B7CC306CBAF3CA41 1445070.dll,FirefoxExt35.pdb,1F888C3FB0FC426D9384F92307B382501 228078g07.dll,DGJR.pdb,A40D46D92CA54937A8520538E0A17D773 3gppttrenderer.dll,3gppttrenderer.pdb,780BC6C76FD04AC49E916DD05D78A8705 813250m16t.dll,MWDL.pdb,449BF30EF5BB4500A4696E80237C51DD2 821109m16t.dll,MWDL.pdb,449BF30EF5BB4500A4696E80237C51DD2 836312m16t.dll,MWDL.pdb,449BF30EF5BB4500A4696E80237C51DD2 ACE.dll,ACE.pdb,B477FA3428A740B7A0C625CB561F7A1C1 ACE.dll,ACE.pdb,FAD341203BFF471B84128EFF1FBF0F561 ACTIVEDS.DLL,activeds.pdb,3B7DE0562 If this looks good to you I can check this in and we can deploy this out somewhere so it can be executed on a daily basis.

(not currently active) Ted Mielczarek

Reporter

Comment 6

•

14 years ago

This looks good. Looking at the output, though, I realize that you could drop column 1, since I don't actually need it. It looks like there are some DLLs there that change their name but not the other info (probably spyware/viruses), so it should reduce the size of the output.

chris hofmann

Comment 7

•

14 years ago

I could use column 1 for some stuff I'm doing, so if possible lets keep it.

(not currently active) Ted Mielczarek

Reporter

Comment 8

•

14 years ago

It's not a big deal either way for me, just something I realized after seeing the duplicate data in the output in comment 5.

Xavier Stevens [:xstevens]

Assignee

Comment 9

•

14 years ago

I'll leave in the dll then. 9/12/2010 on production yielded a list of 47,366 entries. Just going to include my series of commands here for recording purposes. I can document this on socorro wiki or something if we want to later. hadoop jar socorro-analysis-job.jar com.mozilla.socorro.hadoop.CrashReportModuleList -Dproduct.filter="Firefox" -Dos.filter="Windows NT" -Dstart.date=20100912 -Dend.date=20100912 module-list-out hadoop fs -getmerge module-list-out modulelist.txt sort modulelist.txt -o modulelist.sorted

Xavier Stevens [:xstevens]

Assignee

Comment 10

•

14 years ago

Going to set this to fixed. Will work with Laura and team for deployment.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

(not currently active) Ted Mielczarek

Reporter

Comment 11

•

14 years ago

Okay. If you file followup bug(s) on deployment, please make them block bug 575817.

(not currently active) Ted Mielczarek

Reporter

Updated

•

14 years ago

Blocks: 598908

chris hofmann

Comment 12

•

14 years ago

I forgot about this bug; I guess this is similar to the request in Bug 634498. good to see it coming on line soon.

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Socorro → General

Product: Webtools → Socorro

Bugzilla

implement a map/reduce job to produce a list of modules from a day's worth of crash reports

Categories

(Socorro :: General, task)

Tracking

(Not tracked)

People

(Reporter: ted, Assigned: xstevens)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Updated