Closed Bug 1743704 Opened 1 year ago Closed 1 year ago

remove jit-crash-categorize

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(1 file)

We're replacing stackwalker from minidump_stackwalk (with an underscore) with minidump-stackwalker (with a hyphen) from rust-minidump.

I had forgotten that jit-crash-categorize (with hyphens, not that there's any ambiguity here, but for completeness) also comes from minidump_stackwalk (with an underscore).

It's in the processor pipeline, but I don't know much about it.

What does it do? Does Socorro do anything useful with the output? Is the output surfaced anywhere? Is it used?

If it's still interesting, can we re-implement it in rust-minidump?

I looked at the code trying to understand what it was doing.

It seems that this code attempts to determine whether a crash is caused by some corrupted code or not.
This is blind-folded approach which walk over the x86 assembly, attempting to determine whether this is something which is never produced by our JIT, returning CRASH_CORRUPT_CODE, or a bad operand or bad branch target CODE_BAD_BRANCH_TARGET.

Technically this is an approximation which has many flaws. x86 is not a fixed size assembly, thus our page start is not guaranteed to start at an instruction boundary. Then walking these random bytes might yield to decoded gibberish, which could also be identified as corrupted code. So we have a risk of having wrongly reported corrupted code. Our JIT code is patched on invalidation, which can yield to bad results as well.

Then this code bit-rot since it was created. We do generate adc/sbb instructions, which were reported as corrupted code, now exists because we can manipulate 64 bits values in WASM.

The intent behind the current code is good, however, the likelyhood of having correct code being reported as incorrect is high.
I would be happy to help revive something similar in the future, and potentially improve upon it. There is currently some investigation made by the JS team to see how we can refine our crash-stats reports to make more of them actionable.

This context is incredibly helpful!

It also tells me that the current jit-crash-classifier isn't good enough and that I should remove it from the Socorro pipeline. Then we can focus our efforts on something new that we'll use in the future.

Does that sound right?

For historical context, see bug 1124241.

Oh, bug 1268029 shows where it was actually wired up in prod at one point! Also some more reasoning in bug 764223.

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #2)

Does that sound right?

That sounds right to me.

Rewrote the summary per comment #5.

I'm going to grab this now because reducing CI/test times would help a lot right now.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Summary: what is jit-crash-categorize and do we use it? → remove jit-crash-categorize

I deployed this just now in bug #1756845. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.