Incomplete call stacks in perf
Categories
(Core :: Performance: General, defect)
Tracking
()
People
(Reporter: denispal, Assigned: denispal)
References
(Blocks 1 open bug)
Details
(Whiteboard: [sp3:p1])
I still see quite a few call stacks in perf that are incomplete, and often they seem to end at Trampoline code. Are we maybe missing some frame pointers in this code? I'm not sure how often we call into it.
Here's an example:
- 3.02% Isolated Web Co libxul.so [.] js::jit::SetElementMegamorphic
- js::jit::SetElementMegamorphic
- 2.97% Trampolines
0.55% Trampolines
- 1.93% Isolated Web Co libxul.so [.] js::NativeObject::addProperty
- js::NativeObject::addProperty
- 1.69% js::jit::SetElementMegamorphic
- Trampolines
0.67% Trampolines
+ 1.91% Isolated Web Co libxul.so [.] js::GCMarker::doMarking<0u>
- 1.20% Isolated Web Co jitted-1325237-1.so [.] Trampolines
Trampolines
Assignee | ||
Comment 1•3 years ago
|
||
Jan, Andre, do either of you have any thoughts on how to improve these stacks? Thanks!
Comment 2•3 years ago
|
||
I instrumented js::jit::SetElementMegamorphic
(and gc::Cell::header which is used nearly everywhere) to walk the frame pointer chain back to main
and ran a bunch of jsshell tests and it looks like trampolines and JIT code are doing the right thing. The trace you have seems suspicious with the nested trampoline calls.
I did try the same experiment on full browser but ran into some cases where external GTK frames are interleaved into event loop and that code does not have frame-pointers enabled. Kind of interesting but linux-only and pretty obvious if it happens.
There does seem to be a small edge-case here where we end intentionally clobber the frame pointer for a short moment when handling some WASM exceptions.
Here is the experiment I was doing to manually walk the stack using frame pointers (x64 linux only): https://hg.mozilla.org/try/rev/1c7de242e586280165a193ccf70e0a96a5d5e400
Overall I think this is just artifacts of perf
stack walking limit (which is I think 20 frames?). Since "Trampolines" is shared code for all main-thread JIT it just happens to be quite a popular frame to appear. Linux kernel 6.0+ seems to finally have support for extending the frame pointer limit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/tools/perf/util/callchain.c?h=v6.0.12&id=7cb2a53f7f413734734bac214c4855e9863b9853
Comment 3•3 years ago
|
||
Canceling NI per comment 2.
It might be interesting to see if you can reproduce this with a small micro-benchmark in the JS shell. You can force megamorphic ICs with setJitCompilerOption("ic.force-megamorphic", 1);
Assignee | ||
Comment 4•3 years ago
|
||
I've increased the stalk walking limit up to 128 and it didn't change anything so I'm not sure that's the problem. I do see the Trampolines issue when profiling JS2 in the shell, but it's not as prevalent. It also seems to occur more in the shell when Ion is disabled. It could very well be a perf limitation somewhere, but it would be nice to find out what it is.
Comment 5•3 years ago
|
||
(Adding this to Speedometer3 bug tree since that is primary reason to be investigating this now)
Updated•3 years ago
|
Comment 6•3 years ago
|
||
When you disable Ion, you 1) lose inlining of JS, 2) have to use call ICs everywhere which are an extra frame, 3) hit trampolines more often since you aren't using ion optimized builtins. As a result, it isn't surprising that you easily have too deep stacks.
With the following test I see expected behaviour with shallow recursion, and the "trampolines" thing with deep recursion. So this looks to be an issue with how we are trying to use perf
.
setJitCompilerOption("ion.enable", 0);
setJitCompilerOption("baseline.enable", 0);
function recurse_test(n) {
if (n == 0) {
return String.fromCharCode(0x1000);
}
return recurse_test(n - 1);
}
for (var i = 0; i < 10_000_000; i++) {
recurse_test(2);
}
Comment 7•3 years ago
|
||
Moving to Core::Performance and letting Denis decide next steps.
Assignee | ||
Comment 8•3 years ago
|
||
Thanks for the help on this, Ted! It is indeed a perf usage issue, and it turns out that it's just perf not outputting call stacks if the threshold is below 0.5%. Using perf report --no-children --call-graph=graph,0
shows me complete call stacks for all functions.
In addition, installing libc6-prof together with setting /proc/sys/kernel/kptr_restrict to 0 removes most of the [unknown] entries from perf script and we get complete call stacks for libc and kernel code as well. This makes the perf report output match much closer to the profile from profiler.firefox.com when importing the perf script output.
Assignee | ||
Updated•3 years ago
|
Comment 9•3 years ago
|
||
Can you add a note about this to https://firefox-source-docs.mozilla.org/performance/jit_profiling_with_perf.html?
Assignee | ||
Comment 10•3 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
Can you add a note about this to https://firefox-source-docs.mozilla.org/performance/jit_profiling_with_perf.html?
opened bug 1805671 to update this
Updated•2 years ago
|
Description
•