Closed Bug 1779257 Opened 2 years ago Closed 2 years ago

We don't capture native stacks for markers anymore on Linux

Categories

(Core :: Gecko Profiler, defect, P2)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
112 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- wontfix
firefox102 --- wontfix
firefox103 --- wontfix
firefox104 --- wontfix
firefox105 --- wontfix
firefox110 --- wontfix
firefox111 --- wontfix
firefox112 --- fixed

People

(Reporter: julienw, Assigned: canova)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [fxp])

Attachments

(1 file)

STR:

  1. Capture a profile with any marker that captures a stack (eg: Reflow or DoFlushPendingNotifications)

The markers only have non-native stacks.

According to Florian this seems to be Linux only, at least he says this works on MacOS.

mozregression gives this pushlog https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=448f017b7781d8504c547599bddb50814e71e3c7&tochange=c8933e49a64bb79e25d333e7c832880048cb5712 which points to bug 1716959.

Gerald can you please have a look?

OS: Unspecified → Linux
Summary: We don't capture native stacks for markers anymore → We don't capture native stacks for markers anymore on Linux

Set release status flags based on info from the regressing bug 1716959

:gerald, since you are the author of the regressor, bug 1716959, could you take a look?
For more information, please visit auto_nag documentation.

Flags: needinfo?(gsquelart)
Severity: -- → S3
Flags: needinfo?(gsquelart)
Priority: -- → P2

Set release status flags based on info from the regressing bug 1716959

Hey Gerald, can you give some light about how this regression could have happened? Thanks :-)

Flags: needinfo?(mozbugz)

It is quite possible that bug 1716959 caused this, as it moved some things around, that are used for stack-walking.
Unfortunately I'm not a Linux guru, so I'm not sure what's happening.
I did try to revert parts of the patches in bug 1716959, but could not get stack-walking working again. Either I did something wrong, or that bug it not the real cause? (And I couldn't run mozregression either, to test that theory.)

I'll try again, but we may need to get real experts involved...

Flags: needinfo?(mozbugz)

Test various features, for example disable stack sampling, see if this makes a difference and possibly get a pernosco session about that.

Flags: needinfo?(felash)

Changing features doesn't change anything.
I'm trying to get a pernosco session but it's a bit difficult right now, but I'm still looking at this.
I did confirm again that the things that work on MacOS don't work on Linux.

Other markers easy to spot: ViewManagerFlush or NotifyObservers.

I think I managed to capture a pernosco session.
This specific view is when capturing a stack for a NotifyObservers text marker.

This other view is when capturing the cause for a ViewManagerFlush marker.

This is the profile recorded during this rr session => https://share.firefox.dev/3DlESO3

I hope somebody will be able to look at it, because this is quite out of my skills.

Flags: needinfo?(felash)

I spent some time to investigate the issue that we have and I found a fix to it.

We are using getcontext the get the user context when we need to do a sync backtrace capture. And we may use 2 different getcontexts depending on the platform and libc because it's not a part of the standard C library and no longer part of the Posix standard. If libc has it, then we use the libc provided one. If it doesn't, we fallback to breakpad_getcontext.

I'm not sure in which cases we don't have getcontext, but my ubuntu linux machine with the current mozilla-central clang has it. I tried both with libc and breakpad ones though to make sure.

It looks like breakpad's getcontext was working without a problem. But libc getcontext wasn't working properly (which seems like it's the mostly used one from the complaints).

Also I tested with different build flags, and found out that libc getcontext is working when built with no optimizations. When we build with optimization (starting from -O1), it stops working properly.

The next thing I tested was to call the getcontext in the parent right after initializing Registers here instead of calling them inside Registers::SyncPopulate. It appears that this indeed fixes the issue. When I searched for it, it's mentioned in a few places that the saved context is invalid once the function that called getcontext returns. We need to call the getcontext while the frame where we called it is still on the stack. Because stack pointer address will be invalid once the SyncPopulate returns. So that explains why it works when we call it inside the profiler_capture_backtrace_into function but not inside the Registers::SyncPopulate method.

I have a patch now to move this getcontext function out of that method and inline to its parent with a macro. That way the user context will be valid during the stack walking. I tested it manually and it seems like it's working well.

See my comment on here for more context of my investigation:
https://bugzilla.mozilla.org/show_bug.cgi?id=1779257#c9

The saved context is invalid once the function that called getcontext
returns. We need to call the getcontext while the frame where we called it is
still on the stack. That's why this patch is moving the call to getcontext to
parent function by inlining the SyncPopulate content by using a macro instead.
This has to be a macro instead of a function because stack pointer address will
be invalid once the Registers::SyncPopulate returns. I tried to change this
method to inline but that didn't help either.

Assignee: nobody → canaltinova
Status: NEW → ASSIGNED

(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #9)

Because stack pointer address will be invalid once the SyncPopulate returns. So that explains why it works when we call it inside the profiler_capture_backtrace_into function but not inside the Registers::SyncPopulate method.

This reminds me of bug 1714501. Your patch may actually fix that bug as well! (But I'm just passing by, you may want to check for yourself -- Thank you for working on this.)

Hey Gerald! Good to see you around :)

(In reply to Gerald Squelart (he/him) from comment #11)

(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #9)

Because stack pointer address will be invalid once the SyncPopulate returns. So that explains why it works when we call it inside the profiler_capture_backtrace_into function but not inside the Registers::SyncPopulate method.

This reminds me of bug 1714501. Your patch may actually fix that bug as well! (But I'm just passing by, you may want to check for yourself -- Thank you for working on this.)

Oh! That looks like the exact problem! I didn't know about this bug, thanks for pointing it. It answers why we see different results for non-opt and optimized builds. Probably it's because of the tail-call optimization happening on the optimized ones.

And it mentions that macOS get the values from the caller, which means that now after it's inlined to the parent, we might not see profiler_capture_backtrace_into frame anymore. This could be a good thing (I remember some people mentioning that it being useless as it's always obviously there), but probably we should be consistent with other platforms or decide intentionally. I will investigate that a bit more.

Update about macOS:
It looks like we don't see profiler_capture_backtrace_into frame on the optimized builds already. It appears that Registers::SyncPopulate is being inlined on macOS. So changing this function to macro doesn't have any visible affect after all. Here are before/after profiles:

macOS before the fix: https://share.firefox.dev/3xtITwz
macOS after the fix: https://share.firefox.dev/3XTFPoB

Whiteboard: [fxp]
Pushed by canaltinova@gmail.com: https://hg.mozilla.org/integration/autoland/rev/19c1023da36b Fix the stack walking for linux markers by calling getcontext in a valid stack frame r=mstange
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch
Duplicate of this bug: 1714501
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: