Open Bug 1686565 Opened 4 years ago Updated 1 year ago

Stackwalking sometimes missing in startup profiles

Categories

(Core :: Gecko Profiler, defect, P2)

Desktop
Windows 10
defect

Tracking

()

People

(Reporter: florian, Unassigned)

References

Details

I'm observing this bug frequently on my Acer reference laptop: https://share.firefox.dev/3ii9CnG

In the parent process, on every sample the only native frame I can see is the leaf. On the GPU Process, stackwalking worked normally.
I noticed that on processes where the stackwalking didn't work, it worked in the base profiler ("GeckoMain (pre-xul)" track). And on the process where it did work, then it was broken in the base profiler (see the GPU Process' GeckoMain pre-xul track).

Somehow, I can reproduce this pretty consistently on my Windows 10 laptop running the build 16299 of Windows 10, but I can't reproduce it on my other reference laptop that's running the build 19041.
I don't know if the different Windows 10 version is actually the reason for the different behavior.

I'll see if I can reproduce it...

This makes me want to implement bug 1676271 and/or bug 1683040 even more, to store more profiler diagnostic notes. In this case, if we could record stack-walking API failure reasons, it would certainly help to isolate the issue. 🤔

(In reply to Florian Quèze [:florian] from comment #0)

I noticed that on processes where the stackwalking didn't work, it worked in the base profiler ("GeckoMain (pre-xul)" track). And on the process where it did work, then it was broken in the base profiler (see the GPU Process' GeckoMain pre-xul track).

If it was only one way (base ok, gecko fail), it could point as some bad interaction between the two profilers, but since it happens the other way I doubt that's the case here.
If you'd like to verify that, please add MOZ_PROFILER_STARTUP_NO_BASE=1 in your environment to prevent the base profiler from running, and see if the gecko profiler ever fails.

Severity: -- → S2
Priority: -- → P2

(In reply to Gerald Squelart [:gerald] (he/him) from comment #1)

If it was only one way (base ok, gecko fail), it could point as some bad interaction between the two profilers, but since it happens the other way I doubt that's the case here.
If you'd like to verify that, please add MOZ_PROFILER_STARTUP_NO_BASE=1 in your environment to prevent the base profiler from running, and see if the gecko profiler ever fails.

I've just done some more testing and I'm increasingly confused. The first profile I got today was OK in both the base and gecko profilers, on all processes (this was likely just an accident, I checked later that rebooting doesn't help). Then I got a profile where I saw the same failures as yesterday. Then I added the MOZ_PROFILER_STARTUP_NO_BASE=1 environment variable, and got 3 profiles in a row where the gecko profiler was OK in all processes. Now I removed the environment variable, and at the next profile I saw failures again like yesterday. And I just took another profile that's failing similarly, but has 2 processes that are OK in both the base and gecko profilers.
I'm also sometimes seeing deadlocks during startup, either with a black screen, or with the skeleton UI. Sometimes it's content processes that fail to start (eg. I get a blank tab instead of about:home). If I encountered this on my devlopment machine I would attach Visual Studio to get a stack, but installing Visual Studio on my reference laptop isn't realistic.

(In reply to Florian Quèze [:florian] from comment #0)

Somehow, I can reproduce this pretty consistently on my Windows 10 laptop running the build 16299 of Windows 10, but I can't reproduce it on my other reference laptop that's running the build 19041.
I don't know if the different Windows 10 version is actually the reason for the different behavior.

The machine is currently installing the Windows 10 20H2 feature update, so we'll soon know if the old version of Windows 10 was the reason for these failures. And if it was, I guess I'll have to resolve this as worksforme, as we may no longer have a way to reproduce.

(In reply to Florian Quèze [:florian] from comment #3)

The machine is currently installing the Windows 10 20H2 feature update

After updating, I get correct stack walking most of the time, but relatively frequently I find one base profiler that failed to walk the stacks. For example in https://share.firefox.dev/2XIvlM6 stack walking failed in the base profiler part of the "Web Content" and "Privileged Content" process, and worked everywhere else.

(In reply to Florian Quèze [:florian] from comment #4)

(In reply to Florian Quèze [:florian] from comment #3)

The machine is currently installing the Windows 10 20H2 feature update

After updating, I get correct stack walking most of the time, but relatively frequently I find one base profiler that failed to walk the stacks. For example in https://share.firefox.dev/2XIvlM6 stack walking failed in the base profiler part of the "Web Content" and "Privileged Content" process, and worked everywhere else.

Or maybe it worked everywhere and I just saw the effect of https://searchfox.org/mozilla-central/rev/2a24205479519e70c0574929f45730d285141584/mozglue/dllservices/LoaderObserver.cpp#51 disabling stack walking. I'm starting to wonder if we should have label frames when stack walking is intentionally disabled, to reduce confusion.

Ok, https://share.firefox.dev/3ifhIgK is a better example of stack walking failure, on the Gecko Profiler for the Privileged Content process this time.

(In reply to Florian Quèze [:florian] from comment #5)

(In reply to Florian Quèze [:florian] from comment #4)

Or maybe it worked everywhere

Thank you for testing all this... So does it mean WorksForMe now? 😅

and I just saw the effect of https://searchfox.org/mozilla-central/rev/2a24205479519e70c0574929f45730d285141584/mozglue/dllservices/LoaderObserver.cpp#51 disabling stack walking. I'm starting to wonder if we should have label frames when stack walking is intentionally disabled, to reduce confusion.

Bug 1683040 would really help know about this, if we could insert samples with the reason why there's no stack.

(In reply to Gerald Squelart [:gerald] (he/him) from comment #7)

(In reply to Florian Quèze [:florian] from comment #5)

(In reply to Florian Quèze [:florian] from comment #4)

Or maybe it worked everywhere

Thank you for testing all this... So does it mean WorksForMe now? 😅

Unfortunately, no, it just means the profile in comment 4 was a bad example, see comment 6 for a better example.

I'm bumping this from S2 to S3 because it doesn't affect users.

Severity: S2 → S3
See Also: → 1836225
You need to log in before you can comment on or make changes to this bug.