Closed Bug 938145 Opened 11 years ago Closed 2 years ago

EHABIStackWalk get stuck in ~15% of leafs

Categories

(Core :: Gecko Profiler, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: BenWa, Assigned: jld)

References

Details

I notice that ~15% (This will vary greatly I imagine on what's running) samples are not fully unwinding and are instead getting stuck in the leaf.

From IRC:
jld: Those could be prologue/epilogue samples?  I didn't think they were as common as that.  If I had the non-symbolicated profile and the libxul.so I could try to figure out if that's it. 

Let me attach those here.
These seem to be no attachments on this bug at the moment.
Flags: needinfo?(bgirard)
This profile has about 8% failure in the content thread:
http://people.mozilla.org/~bgirard/cleopatra/#report=1affded3339530bc11e20a1c67f243e74100de11

But there's a lot of idle time so I could get that number higher with more activity.

I think a large chunk is the time spent in pixman, fixing that would be nice but not high priority. There's a lot of junk frames in the failed stack like '0' and random invalid address like '062650xbeb2ec40' which we should really fix.
Flags: needinfo?(bgirard)
Assignee: nobody → jld
Here are two of me typing rapidly on the default keyboard, which has a lot of deep call stacks in C++ code; the second has the patch from bug 914561 for unwinding past jitcode:

https://people.mozilla.org/~bgirard/cleopatra/#report=cdbb553021ccc0b4bd4c505c3fd9eec591aa6cd0
https://people.mozilla.org/~bgirard/cleopatra/#report=98cdc579a9e60b9c471016c09e38f9aa4ec60c88

If I subtract the pixmain/cairo/etc. failures (which could be fixed by annotating the assembly source), there's still about 5-10% that we're missing.  Are we willing to live with that, at least for now?

(I'm assuming the longer-term plan would involve optimizing the memory use of the unwinder from bug 938157 to make it usable as a replacement, so that we don't need to try to add prologue/epilogue heuristics to EHABIStackWalk.)
Flags: needinfo?(bgirard)
(In reply to Jed Davis [:jld] from comment #3)
> If I subtract the pixmain/cairo/etc. failures (which could be fixed by
> annotating the assembly source)

jrmuizel has a fix for that which he will be commiting.

A fix for bug 914561 and jeff changes should be good enough to resolve this bug. However your second profile I still see unwinds not resuming after the RunScript. I though we could resume unwinds after each RunScript?
Flags: needinfo?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #4)
> However your second profile I still see unwinds not resuming after the
> RunScript. I though we could resume unwinds after each RunScript?

It seems to be working: look for the EnterJIT pseudostack frames.  Some of those |RunScript|s are for the interpreter.  It looks like most of them are cases where we fail to unwind from the leaf, but there's one exception that I see: the stack that allegedly ends in |JSFunction::environment() const|, or 0x41c6ddf0 (file-relative 0x12c1df0).  There are two bugs there:

1. It's actually the first instruction of js::InterpreterStack::pushInvokeFrame — I fixed the symbolication script to clear the Thumb bit and subtract 1, so we look up the call instruction instead of the instruction afterwards, but this is wrong for the leaf.  There isn't a good fix unless we have symbolication change the numbers as well as adding symbolication table entries.

2. The incorrect unwinding of that function (we haven't executed its first instruction yet, but we unwind its 12-word frame) happens to hit a false stack that's still unwindable for four frames.  This is similar to what happens with Breakpad's stack scanning, except it's less common because we're effectively just checking one word for being a usable return address, not scanning a range of them.  (This case would also defeat the proposed ideas of heuristically detecting prologue/epilogue samples via failing to get a valid PC from unwinding.)

The profiler is now using JIT information to restart stack traces when they get lost.
I'll mark this old bug fixed.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.