938145 - EHABIStackWalk get stuck in ~15% of leafs

Reporter

Description

•

11 years ago

I notice that ~15% (This will vary greatly I imagine on what's running) samples are not fully unwinding and are instead getting stuck in the leaf. From IRC: jld: Those could be prologue/epilogue samples? I didn't think they were as common as that. If I had the non-symbolicated profile and the libxul.so I could try to figure out if that's it. Let me attach those here.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 1

•

11 years ago

These seem to be no attachments on this bug at the moment.

Flags: needinfo?(bgirard)

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

Blocks: 926610

Benoit Girard (:BenWa)

Reporter

Comment 2

•

11 years ago

This profile has about 8% failure in the content thread: http://people.mozilla.org/~bgirard/cleopatra/#report=1affded3339530bc11e20a1c67f243e74100de11 But there's a lot of idle time so I could get that number higher with more activity. I think a large chunk is the time spent in pixman, fixing that would be nice but not high priority. There's a lot of junk frames in the failed stack like '0' and random invalid address like '062650xbeb2ec40' which we should really fix.

Flags: needinfo?(bgirard)

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

Assignee: nobody → jld

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 3

•

11 years ago

Here are two of me typing rapidly on the default keyboard, which has a lot of deep call stacks in C++ code; the second has the patch from bug 914561 for unwinding past jitcode: https://people.mozilla.org/~bgirard/cleopatra/#report=cdbb553021ccc0b4bd4c505c3fd9eec591aa6cd0 https://people.mozilla.org/~bgirard/cleopatra/#report=98cdc579a9e60b9c471016c09e38f9aa4ec60c88 If I subtract the pixmain/cairo/etc. failures (which could be fixed by annotating the assembly source), there's still about 5-10% that we're missing. Are we willing to live with that, at least for now? (I'm assuming the longer-term plan would involve optimizing the memory use of the unwinder from bug 938157 to make it usable as a replacement, so that we don't need to try to add prologue/epilogue heuristics to EHABIStackWalk.)

Flags: needinfo?(bgirard)

Benoit Girard (:BenWa)

Reporter

Comment 4

•

11 years ago

(In reply to Jed Davis [:jld] from comment #3) > If I subtract the pixmain/cairo/etc. failures (which could be fixed by > annotating the assembly source) jrmuizel has a fix for that which he will be commiting. A fix for bug 914561 and jeff changes should be good enough to resolve this bug. However your second profile I still see unwinds not resuming after the RunScript. I though we could resume unwinds after each RunScript?

Flags: needinfo?(bgirard)

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Assignee

Comment 5

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #4) > However your second profile I still see unwinds not resuming after the > RunScript. I though we could resume unwinds after each RunScript? It seems to be working: look for the EnterJIT pseudostack frames. Some of those |RunScript|s are for the interpreter. It looks like most of them are cases where we fail to unwind from the leaf, but there's one exception that I see: the stack that allegedly ends in |JSFunction::environment() const|, or 0x41c6ddf0 (file-relative 0x12c1df0). There are two bugs there: 1. It's actually the first instruction of js::InterpreterStack::pushInvokeFrame — I fixed the symbolication script to clear the Thumb bit and subtract 1, so we look up the call instruction instead of the instruction afterwards, but this is wrong for the leaf. There isn't a good fix unless we have symbolication change the numbers as well as adding symbolication table entries. 2. The incorrect unwinding of that function (we haven't executed its first instruction yet, but we unwind its 12-word frame) happens to hit a false stack that's still unwindable for four frames. This is similar to what happens with Breakpad's stack scanning, except it's less common because we're effectively just checking one word for being a usable return address, not scanning a range of them. (This case would also defeat the proposed ideas of heuristically detecting prologue/epilogue samples via failing to get a valid PC from unwinding.)

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 6

•

3 years ago

The profiler is now using JIT information to restart stack traces when they get lost.
I'll mark this old bug fixed.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Bugzilla

EHABIStackWalk get stuck in ~15% of leafs

Categories

(Core :: Gecko Profiler, defect)

Tracking

()

People

(Reporter: BenWa, Assigned: jld)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6