Closed Bug 1124241 Opened 5 years ago Closed 5 years ago

Integrate jit-crash-categorize into Socorro

Categories

(Socorro :: Backend, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: lars)

References

()

Details

A few years ago dvander and I wrote a tool called jit-crash-categorize which could be run on minidumps of crashes that were inside our JavaScript JIT and bucket them more usefully, as they all wind up in a single bucket.

lars' work in bug 1121462 should make it easy to write a processor rule to invoke this tool on all JIT crashes and provide extra info on them as well as change the signature to get more useful bucketing.
see Bug 1165000 for attachments that show what crashes should be subject to this extra processing tool.

This will likely have to wait for Q3
Assignee: nobody → lars
lars asked me to expound upon the "change the signature" bit of comment 0.

Right now all JIT crashes wind up under a single signature: the C++ method where we call into JIT code. It's something like [@ EnterBaselineJIT ] right now (it can change across releases as the JIT engine gets refactored or we add new JIT layers). The actual crashes are in generated JIT code, so there can be a wide variety of root causes all lumped into the same signature.

The jit-crash-categorize, given a dump that represents a JIT crash, attempts to give a more useful bucket for the crash, such as "INSTRUCTION_POINTER_NOT_EXECUTABLE" if the instruction pointer in the crashing frame wound up in non-executable memory. What we'd like to do is for crashes with specific signatures (such as [@ EnterBaselineJIT ]), run this tool and use its output to split the crashes into more-specific buckets, perhaps simply by making the signature [@ jit | <reason> ], like perhaps [@ jit | INSTRUCTION_POINTER_NOT_EXECUTABLE ].
I'd like to get altered signatures eventually, but personally I'm not in a hurry to have that right now. Perhaps we could do that as a followup (especially if it increases the odds of being able to get _something_ running in Q2).
If changing the signature isn't easily doable then as a first pass just getting the data out of the tool into the processed crash would be a good start.
Depends on: 1171937
in a surprise move after realizing that the groundwork that I laid a few months ago made it dead simple, I've got a PR now for inclusion of this in processor rule set.  See https://github.com/mozilla/socorro/pull/2834

The only question is about determining which crashes to apply the jit-category external program.  I've coded it using the same criterion used in the attachment for Bug 1165000. Ted is in favor of just having a list of signatures in configuration that will trigger this external program.  I'm interested in a more automated solution so that we don't have to maintain a list in configuration. 

what are the needs of the consumers of this data? Please contribute to the discussion either here or on the PR itself.
As a consumer I have no preference. As a developer I suspect that the no-module criterion (i.e. bug 1165000) is going to need less maintenance over time.
(In reply to David Major [:dmajor] from comment #6)
> As a consumer I have no preference. As a developer I suspect that the
> no-module criterion (i.e. bug 1165000) is going to need less maintenance
> over time.

I'd be interested to know whether your sample of crashes from that bug wound up being entirely JIT crashes, or if not how many other things were mixed in.

I'm only wary about using the looser criteria here if we start changing crash signatures to bucket things based on it. If we do that it's entirely possible that we could start bucketing non-JIT crashes in with JIT crashes which would be bad. If we're simply going to be running jit-crash-categorize on the dumps and recording the output then it's not harmful to run it on some non-JIT crashes.
Ah, I see. That's a fair concern. There were a fair amount of non-JIT crashes in the sample.

I guess that's a point in favor of using EnterBaseline and friends. Like I said, as a consumer I don't mind :)
Commenting after talking with dmajor on irc.  Dmajor mentioned modifying the crash selection logic to code that included EnterBaseline on the stack.

I'd agree that's a good idea to try to narrow down cases where the crashes have a higher chance of being jit-related.  I'd suggest the following criteria:

1. EnterBaseline or EnterIon on stack (sometimes we enter Ion directly instead of going through baseline).
2. We'd like to avoid cases where the presence of EnterBaseline|EnterIon on stack is misleading.  For example, if we enter the jit, and then re-enter the interpreter, and enter the JS VM code, and wander down a long call-path there, and then crash.. chances are that the crash is unlikely to have been caused by the jit.

The selected stacks should fit one of the following profiles:

   <EnterBaseline|EnterIon> <raw-addr> <raw-addr> ... <raw-addr>

  This suggests a case where we entered jitcode, and stayed in jitcode while going down a deep callpath (hence the raw unmapped addresses on stack).

Alternatively,

  <EnterBaseline|EnterIon> <SomeFrame1> <SomeFrame2> ... <SomeFrameN>   (where N < some threshold).

If EnterBaseline or EnterIon shows up on stack, and there's a crash with a frame depth of less than, say, 8 frames, then it's suggestive of a crash induced by the jit.
Right. I've been a bit loose with the terminology. When I say EnterBaseline I mean "things that would currently produce an [@ EnterBaseline] or [@ EnterIon] signature", which is pretty much:

>    <EnterBaseline|EnterIon> <raw-addr> <raw-addr> ... <raw-addr>

I'm not sure how difficult it would be to add:

>   <EnterBaseline|EnterIon> <SomeFrame1> <SomeFrame2> ... <SomeFrameN>  
> (where N < some threshold).
Given that the stackwalker is completely unable to walk Ion stacks (due to not having frame pointers), I think we actually never have "EnterIon" on the stack that the processor sees.
I suggested 3 months ago that we should make a flag as part of the crash reporter to highlight if we are running under the jit.  A simple way, would be to look for the JitActivation sections, and synchronized modifications made to it with the data reserved by the crash reporter.

(In reply to Kannan Vijayan [:djvj] from comment #9)
>   <EnterBaseline|EnterIon> <SomeFrame1> <SomeFrame2> ... <SomeFrameN>  

This is highly debatable, as many functions are called directly from the Jit, and such frame might be real while we are no longer in the Jit.  On the other hand, I guess the question is whatever these frames got inferred by scanning the stack linearly the stack or not.
We have an annotation for "in GC" already that's pretty low-overhead (it just sets or unsets a bool):
https://hg.mozilla.org/mozilla-central/annotate/e10e2e8d8bf2/toolkit/crashreporter/nsExceptionHandler.h#l75

We could add the same thing for "in JIT code", and then use that annotation to trigger the analysis.
That sounds awesome, but I don't want to wait on it. I'd like to get some initial data rolling in so we can discuss the results at Whistler. Can we start with the EnterBaseline variant and follow up with the precise annotations later?
I have no opposition to using either the signature-based or simple "top frame is a bare address" method. It sounds like lars has already implemented the latter, so let's go with that and we can implement a more precise method (perhaps nbp's suggestion from comment 13) in a follow-up.
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/bc2e3a1cc4df3fb6dcf015f2422bd06f1cd9f754
Merge pull request #2834 from twobraids/jit-category

Fixes Bug 1124241 - added jit category external program
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/d09356e9ee0e2990df99b431ad40e17c7fea7b4e
Merge pull request #2847 from twobraids/more-JIT

Fixes Bug 1124241 - added  "no module" predicate
Blocks: 1268029
You need to log in before you can comment on or make changes to this bug.