Open Bug 1862275 Opened 7 months ago Updated 2 months ago

Crash in [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] during BackgroundHangThread::Notify()

Categories

(Core :: Gecko Profiler, defect, P3)

Other
Windows
defect

Tracking

()

People

(Reporter: release-mgmt-account-bot, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [fxp])

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/919a857f-5a5c-48ea-a4fa-106710231026

Reason: EXCEPTION_STACK_OVERFLOW

Top 10 frames of crashing thread:

0  xul.dll  mozilla::profiler::PlatformData::ProfiledThread const  tools/profiler/public/ProfilerThreadPlatformData.h:31
0  xul.dll  DoMozStackWalkBacktrace  tools/profiler/core/platform.cpp:2194
1  xul.dll  profiler_suspend_and_sample_thread::<lambda_120>::operator const  tools/profiler/core/platform.cpp:7105
1  xul.dll  Sampler::SuspendAndSampleAndResumeThread  tools/profiler/core/platform-win32.cpp:286
1  xul.dll  profiler_suspend_and_sample_thread  tools/profiler/core/platform.cpp:7136
2  xul.dll  profiler_suspend_and_sample_thread::<lambda_19>::operator const  tools/profiler/core/platform.cpp:7190
2  xul.dll  mozilla::profiler::ThreadRegistry::OffThreadRef::WithLockedRWFromAnyThread  tools/profiler/public/ProfilerThreadRegistry.h:188
2  xul.dll  profiler_suspend_and_sample_thread::<lambda_19>::operator const  tools/profiler/core/platform.cpp:7186
2  xul.dll  mozilla::profiler::ThreadRegistry::WithOffThreadRef  tools/profiler/public/ProfilerThreadRegistry.h:259
2  xul.dll  profiler_suspend_and_sample_thread  tools/profiler/core/platform.cpp:7184

By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:

  • First crash report: 2023-08-22
  • Process type: Multiple distinct types
  • Is startup crash: No
  • Has user comments: No
  • Is null crash: No
Component: General → Gecko Profiler
Duplicate of this bug: 1862278

I looked at about 10 crashes between the two signatures, and I noticed they are all happening on the BHMgr Monitor thread, and there's always BackgroundHangThread::Notify() on the stack. It looks like the main thread has been hanging long enough that it is collecting a stack. I think this also means it isn't people who have opted in to turning on the profiler, which would make this less of an issue.

Florian, do you know if there's anything odd about this thread, like a smaller stack space, that would cause these kinds of stack overflow when trying to report a hang from background hang monitoring? Thanks.

Flags: needinfo?(florian)
Summary: Crash in [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] → Crash in [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] during BackgroundHangThread::Notify()

Copying crash signatures from duplicate bugs.

Crash Signature: [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] → [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] [@ stackoverflow | DoMozStackWalkThread]

Some of these crashes are OOMs, there's not enough commit space to enlarge the stacks. Others aren't and it's unclear what might be causing them.

Crash Signature: [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] [@ stackoverflow | DoMozStackWalkThread] → [@ stackoverflow | mozilla::profiler::PlatformData::ProfiledThread] [@ stackoverflow | DoMozStackWalkThread]

(In reply to Andrew McCreight [:mccr8] from comment #2)

I think this also means it isn't people who have opted in to turning on the profiler, which would make this less of an issue.

It's not people who have turned on the profiler indeed, but BHR is only enabled on the Nightly channel.

Florian, do you know if there's anything odd about this thread, like a smaller stack space, that would cause these kinds of stack overflow when trying to report a hang from background hang monitoring?

I don't know. If it's OOM crashes, maybe this thread should reserve enough memory to capture a profiler stack when the thread is created.

Flags: needinfo?(florian)
Severity: -- → S3
Priority: -- → P3
Whiteboard: [fxp]

There was a spike on Feb 27 (24 reports), but it looks like they are all from the same machine, as they're happening all within 7 seconds, with the same hardware.
These reports say that there are 13GB available physical memory though. Gabriele, what makes you think that this is really an OOM?

Flags: needinfo?(gsvelto)

(In reply to Julien Wajsberg [:julienw] from comment #6)

There was a spike on Feb 27 (24 reports), but it looks like they are all from the same machine, as they're happening all within 7 seconds, with the same hardware.
These reports say that there are 13GB available physical memory though. Gabriele, what makes you think that this is really an OOM?

The ones from the 27th don't look like OOMs. What you need to check to be sure if it's an OOM or not is the available page file, because that's the hard limit that Windows uses for memory (see this crash for example which still has some available physical memory but no page file left, also see my old article for a more in-depth explanation of how commit space works on Windows in case you're curious).

Something odd about the 27th crashes is that you can see the available page file increase in the crashes, as if the child processes dying were freeing memory. It could have been a user with a bazillion open tabs, who turned on the profiler and caused a cascade of crashes due to the increase in memory consumption, but this is just a theory.

Flags: needinfo?(gsvelto)
You need to log in before you can comment on or make changes to this bug.