Closed Bug 627111 Opened 10 years ago Closed 10 years ago

Run a Map/Reduce to better categorize MethodJIT crashes

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: aphadke)

References

Details

Attachments

(1 file, 1 obsolete file)

In bug 595351 comment 61, dvander had some ideas about how to categorize the set of MethodJIT crashes into more approachable buckets. I pointed him at some of the one-off tools I've written using the Breakpad processor APIs:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/get-minidump-instructions/
http://hg.mozilla.org/users/tmielczarek_mozilla.com/check-minidump-instruction-pointer/
http://hg.mozilla.org/users/tmielczarek_mozilla.com/exploitable/

as well as a sample of the x86 disassembler that Breakpad includes:
http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/exploitability_win.cc#205

I think we can fairly easily write a C++ tool to categorize any given MethodJIT crash given its minidump. dvander is going to look into writing said tool, and once finished we'd like to run it on a sample of crashes (TBD).
Actually, we'll reassign to aphadke when we have the tool ready.
Assignee: aphadke → dvander
First cut at such a tool:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/jit-crash-categorize/

dvander is making it more useful.
I just pushed a few changes to the tool. The first set of changes can correctly categorize the initial set of 13 crash dumps I looked at. 

Dave Mandelin gave me a tool to download a large batch of mdmps. I ran it on about 100 and made some more tweaks. Likely it'll need further refinement, but this is a good start to filter out things that I can't really look at.

The heuristics are pretty simple. First, it looks at each byte of the code stream until it finds a sane instruction. Then it disassembles normally. Anything weird (like huge offsets, or crazy instructions) are flagged as corrupt. If it hits EIP without finding a sane instruction, this is a crash category. If it never finds EIP, this is a second crash category.

If it finds EIP, and encountered a corrupt instruction, that is the third crash category. Finally, if it finds EIP, and found no corrupt instructions, that is the last crash category.
Okay, I think we're ready to give this a shot. My proposal is to get a full day's worth of minidumps having any of the signatures [@ EnterMethodJIT ] [@ js::mjit::EnterMethodJIT(JSContext*, JSStackFrame*, void*, js::Value*) ] [@ js::mjit::EnterMethodJIT ] from Firefox 4.0b10, and then we'll run the tool on it. aphadke: can you get us that sample of dumps? Do you want to just put them somewhere and have me run the tool on them afterwards, or what? Let me know how you'd like to go about the specifics here.
Assignee: dvander → aphadke
Ted - 
minidumps == raw byte encoded format?
Do you need anything else with it?
I can have each dump as an individual file and store it as <ooid>.dump ?

Is it okay if we do this on Monday or we need it over over the weekend?
Yes, the raw binary file (the .dump file) is what we need here. Each individual file in <ooid>.dump sounds perfect.

Monday should be okay, unless dvander or someone wants to do analysis over the weekend (I wasn't signing myself up for that!)
Ted - Do you have a sandbox where I can pass in the raw dump file location and expect some results? Given the binary nature of data, it'll be nice to test it on a sample to confirm the dump validity before the full run.
Here are precompiled binaries, I built them on Ubuntu:
http://people.mozilla.com/~tmielczarek/categorize32.bz2 32-bit Linux
http://people.mozilla.com/~tmielczarek/categorize64.bz2 64-bit Linux

You just pass it the path to a dump file on the commandline, and it should spit out a category on stdout. (stderr will be pretty noisy)
Ted - I sent u the dumps in a separate email. These aren't specific to JIT, but should confirm whether my MR job can correctly read binary data from HBase and write to local disk as individual files.
code committed to SVN under svn.mozilla.org/moco/metrics/aphadke_sandbox/src/com/mozilla/main/CategorizeCrashes.java rev. 396

Will be running the code on prod cluster later today (1/31)
Ted - I have the dumps for 1/30, Fx 4b10 for "EnterJIT", where should I upload them?
sample command line:

hadoop jar categorizeCrash.jar -Dstart.date=20110130 -Dend.date=20110131 -Dbrowser.version=4.0b10 -Dsignature.match=EnterMethodJIT output.2/

retrieve raw dumps (<ooid>.dump) for 1/30-1/31 date range, product firefox, version 4.0b10 that contain signature "EnterMethodJIT" in "processed_data:json" 

date-range, product, version, signature can be changed via cmd line.

Ted - I'll create a wrapper script that will run the above MR job and also dump the output to a given location once aravind/jabba give us access to a common data store.
Preliminary results:

Total dumps in sample: 7927
Breakdown:
   4684 NO_JIT_MEMORY
   1392 CORRUPT_CODE
    979 UNKNOWN
    766 ERROR
     83 EIP_IN_BETWEEN
     22 BAD_EIP_INSTRUCTION
      1 NON_X86_WITH_JIT_MEMORY

Of the NO_JIT_MEMORY crashes, 4268 (91%) are from Windows XP / Windows Server 2003, which means that we'll probably have good data from most of them in b11 with bug 599301 fixed. I didn't check further, but I would hazard a guess that a good portion of the other 9% have an instruction pointer that's not in mapped memory. I can check that later.

Of the ERROR crashes, 757 of them (99%) are showing the error "unknown context type 0x0", so that bears more investigation.

I'll attach the full ID -> bucket mapping in a minute.
I'm going to call this fixed, we'll file a new bug if we need to run a new analysis. Thanks aphadke!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
After analyzing a few of the ERROR crashes, I don't think these results are valid. It looks like the sample of crashes isn't all EnterMethodJIT crashes, just crashes where that's on the stack, which can be anywhere we're executing JS. We'll have to get this re-run.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
ted - please find the dumps.tar.gz @ the home folder created by jabba.
We got a total of 1808 crashes.....
Okay, running on the new data, we have:
   1446 NO_JIT_MEMORY
    184 CORRUPT_CODE
     99 UNKNOWN
     61 EIP_IN_BETWEEN
     16 BAD_EIP_INSTRUCTION
      2 ERROR

Of the NO_JIT_MEMORY crashes, 1381 (96%) are from Windows XP and Win2k3. Of the 56 NO_JIT_MEMORY crashes that we had from Windows 7, 55 of them had an instruction pointer that was in unmapped memory. (Not sure what the one outlier is, I should see what happened there.)

I'll attach the full list of dump IDs->bucket in a minute.
Heh, the one NO_JIT_MEMORY outlier is actually a crash in the EnterMethodJIT function, not JIT code:
https://crash-stats.mozilla.com/report/index/846c83e9-6ea6-4299-bf43-d13cf2110130
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Blocks: 633270
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.