Closed Bug 1726125 Opened 3 years ago Closed 3 years ago

Gecko profiler crashes when running under rr, when the "Native stacks" feature is enabled

Categories

(Core :: Gecko Profiler, defect, P3)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
93 Branch
Tracking Status
firefox93 --- fixed

People

(Reporter: julienw, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Prerequisite: have rr installed locally.

STR:

  1. run firefox under rr, for example from mozilla-central:

./mach run --debugger=rr --debugger-args=record

  1. Start the profiler. Just make sure that "Native stacks" is enabled (which is the case in all presets).

=> Crash with this message:

Exiting due to channel error.
Sandbox: Unexpected EOF, op 0 flags 00 path 
Exiting due to channel error.
Exiting due to channel error.

I noticed issues with a few tests:
This crashes:

./mach gtest --debugger=rr --debugger-args="record" GeckoProfiler.Pause

This doesn't crash but fails:

./mach test --debugger=rr --debugger-args="record" tools/profiler/tests/xpcshell/test_feature_stackwalking.js

(of course it passes without rr)

I couldn't find any problems with other tests.

I'm making a debug build to see if I get more logs, and will push a new pernosco recording.

Severity: -- → S3
OS: Unspecified → Linux
Priority: -- → P3
Hardware: Unspecified → x86_64

Pernosco's khuey said "I don't think you can fix this on your side", so it looks like you'll have to choose between rr and native stacks.

A debug build doesn't bring new interesting logs.

However, I could capture a pernosco session from the gtest crash:
https://pernos.co/debug/choiTdzb6D0p9oMB7J3r8Q/index.html#f{m[GiQ,AA_,t[Dw,Bd+3_,f{e[Gh0,Aic_,s{af72QlKAA,bAYk,uKiLhaw,oKjBBsw___

We clearly see the crashing memcpy.

(In reply to Gerald Squelart [:gerald] (he/him) from comment #1)

Pernosco's khuey said "I don't think you can fix this on your side", so it looks like you'll have to choose between rr and native stacks.

Do we have a way to detect that we'll have these mismatched adresses before crashing? And maybe disable native stack automatically in this case?

Looking at the code in DoLULBacktrace: https://searchfox.org/mozilla-central/rev/912ff8d38996365e8b843cd82ddce348d8b7f78b/tools/profiler/core/platform.cpp#2375-2400
end is the root-side end of the stack: 0x7FFDDAFEC770.
start is the leaf-side end of the stack, taken from the SP register: 0x681FFCD0
It's quite clear already that something is fishy!
nCopy is the number of bytes to copy, computed from end - start, it's more than 100 trillion! But then it's just clamped to lul::N_STACK_BYTES (160 KiB).
Instead we should see that end - start is impossibly big, and gracefully give up. There is already a check that start < end, otherwise nCopy is kept at zero; we could do the same if end - start > lul::N_STACK_BYTES.

Julien, could you please try something?
Change the following line in tools/profiler/core/platform.cpp:

if (nToCopy > lul::N_STACK_BYTES) nToCopy = lul::N_STACK_BYTES;

to:

if (nToCopy > lul::N_STACK_BYTES) nToCopy = 0;

If that still fails, try:

if (nToCopy > lul::N_STACK_BYTES) return;

(I'm not familiar with LUL, that's why I'm not 100% confident that this will "just" work.)

Good luck!

Flags: needinfo?(felash)

Both changes fix the crash, and I can capture a profile with them.
What should I look especially in these profiles?

Flags: needinfo?(felash)

(In reply to Julien Wajsberg [:julienw] from comment #4)

Both changes fix the crash, and I can capture a profile with them.
What should I look especially in these profiles?

Not crashing is a good outcome already. 😀
I'm wondering if you still see stacks in the profile? Are there empty samples?

I'll prepare a proper patch soon...

Assignee: nobody → gsquelart

Running the profiler while Firefox runs in rr sometimes crashes, and it seems the SP register is far from the known stack area (ending at StackTop), which then proceeds to copy some protected memory.

Note that there was already a start < end test, this patch adds an equivalent test on the other side, allowing for a 1MB stack size. Observed stacks during quick testing were up to around 520KB, and the crashing rr case was in the trillions(!), so 1MB should be an appropriate number that should catch "wrong" SPs while allowing real stacks.

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2efa540983a1
On Linux, detect if SP is likely to be outside the known stack area - r=mstange
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: