Closed Bug 1801819 Opened 1 year ago Closed 1 year ago

Crash in [@ shutdownhang | PlatformThread::Join]

Categories

(Core :: Graphics, defect, P2)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox107 --- wontfix
firefox108 + fixed
firefox109 --- fixed

People

(Reporter: aryx, Assigned: bradwerth)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file, 1 obsolete file)

87 of 268 crash reports with this signature for the last week have mozilla::widget::WinWindowOcclusionTracker::ShutDown()` in their stack. The crash signature is new in Firefox 106 (no indication for it being a signature change due to crash-stats changes).

Brad, because you worked on bug 1798651 could you also investigate this one?
Jens, could you check if anything needs to be done for the crashes with this signature in general?

Crash report: https://crash-stats.mozilla.org/report/index/996b4243-4841-4b64-b585-bc0210221121

MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdown. Something is blocking the main-thread.

Top 10 frames of crashing thread:

0  ntdll.dll  ZwWaitForSingleObject  
1  KERNELBASE.dll  WaitForSingleObjectEx  
2  xul.dll  PlatformThread::Join  ipc/chromium/src/base/platform_thread_win.cc:76
2  xul.dll  base::Thread::Stop  ipc/chromium/src/base/thread.cc:124
3  xul.dll  mozilla::widget::WinWindowOcclusionTracker::ShutDown  widget/windows/WinWindowOcclusionTracker.cpp:388
4  xul.dll  gfxPlatform::ShutdownLayersIPC  gfx/thebes/gfxPlatform.cpp:1365
5  xul.dll  mozilla::ShutdownXPCOM  xpcom/build/XPCOMInit.cpp:589
6  xul.dll  ScopedXPCOMStartup::~ScopedXPCOMStartup  toolkit/xre/nsAppRunner.cpp:2066
7  xul.dll  mozilla::DefaultDelete<ScopedXPCOMStartup>::operator const  mfbt/UniquePtr.h:459
7  xul.dll  mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset  mfbt/UniquePtr.h:301
Flags: needinfo?(bwerth)

But the reports I looked at show WinWindowOcclusionCalc thread just waiting for events?

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #0)

Jens, could you check if anything needs to be done for the crashes with this signature in general?

It would be nice if we could bucket them by the first stack frame called by ShutdownXPCOM. For example here it would be nsComponentManagerImpl::Shutdown() (ok, we would not know which component, but still), here it'd be gfxPlatform::ShutdownLayersIPC() (would contain the WinWindowOcclusionTracker case) and so forth.

Hmm, or we could bucket by the frames that actually call base::Thread::Stop() ? That would give us probably the most details and a direct pointer to the thread in question. In the first example that would be D3DVsyncSource::Shutdown().

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on beta

:bhood, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(bhood)
Keywords: topcrash

BTW, what puzzles me is that we actually see any shutdown hang at step CCPostLastCycleCollection - with default setting 1 for toolkit.shutdown.fastShutdownStage we should never get past this line in ShutdownXPCOM(). Do people change this setting such often?

(In reply to Jens Stutte [:jstutte] from comment #3)

It would be nice if we could bucket them by the first stack frame called by ShutdownXPCOM. For example here it would be nsComponentManagerImpl::Shutdown() (ok, we would not know which component, but still), here it'd be gfxPlatform::ShutdownLayersIPC() (would contain the WinWindowOcclusionTracker case) and so forth.

Will, is this feasible?

Flags: needinfo?(willkg)

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #6)

(In reply to Jens Stutte [:jstutte] from comment #3)

It would be nice if we could bucket them by the first stack frame called by ShutdownXPCOM. For example here it would be nsComponentManagerImpl::Shutdown() (ok, we would not know which component, but still), here it'd be gfxPlatform::ShutdownLayersIPC() (would contain the WinWindowOcclusionTracker case) and so forth.

Will, is this feasible?

Signature generation can't do this currently--it iterates over the frames and doesn't have the ability to look ahead.

We could add more frames to the prefix list. For 712687a3-aaf0-4130-9638-474420221122, the frames are:

  • PlatformThread::Join
  • base::Thread::Stop
  • D3DVsyncSource::Shutdown
  • gfxPlatform::Shutdown
  • nsLayoutModuleDtor
  • nsComponentManagerImpl::Shutdown
  • mozilla::ShutdownXPCOM

So we could add PlatformThread::Join, base::Thread::Stop, D3DVsyncSource::Shutdown, gfxPlatform::Shutdown, and nsLayoutModuleDtor to the prefix list and that would get:

app@socorro:/app$ socorro-cmd signature 712687a3-aaf0-4130-9638-474420221122
Crash id: 712687a3-aaf0-4130-9638-474420221122
Original: shutdownhang | PlatformThread::Join
New:      shutdownhang | PlatformThread::Join | base::Thread::Stop | D3DVsyncSource::Shutdown | gfxPlatform::Shutdown | nsLayoutModuleDtor | nsComponentManagerImpl::Shutdown
Same?:    False

Does that work here?

If not, we should write up a bug for this in Socorro :: Signature and Gabriele and I can scheme. Maybe we turn shutdownhang into more of a special case with a different signature generation algorithm?

Flags: needinfo?(willkg)

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #7)

Does that work here?

If not, we should write up a bug for this in Socorro :: Signature and Gabriele and I can scheme. Maybe we turn shutdownhang into more of a special case with a different signature generation algorithm?

I think we can just prefix up to base::Thread::Stop and keep the remaining detail. Actually as I think of it it seems to be the better approach, pointing directly to the thread in question. Thanks!

Assignee: nobody → bwerth
Severity: -- → S2
Flags: needinfo?(bwerth)
Priority: -- → P2
Flags: needinfo?(bhood)
Regressed by: 1798651

(In reply to Jens Stutte [:jstutte] from comment #1)

Brad, looking at https://searchfox.org/mozilla-central/rev/650c19c96529eb28d081062c1ca274bc50ef3635/nsprpub/pr/src/pthreads/ptsynch.c#268 I wonder if we actually ever get a non NS_OK if we time out?

This is the root cause. ReentrantMonitor calls down to pt_TimedWait, which elides the return value. Unfortunately, this means the changes in D161195 need to be undone since the new timed wait function will return NS_OK even in timeout. One potential fix is to build an alternative to SynchronousTask which instead uses Monitor. I'll pursue that.

ReentrantMonitor (used by SynchronousTask) calls through to pt_TimedWait
which elides timeout information. This patch creates a workalike
NonReentrantTimedSynchronousTask which instead uses Monitor. It also
changes the Windows thread Shutdown methods to use this new task.

Attachment #9304785 - Attachment is obsolete: true
Pushed by bwerth@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/58bf4c34d5c2
Make SynchronousTask double-check its completion flag to identify timeouts. r=sotaro

Brad, is this something we might uplift to 108? If not, please update the tracking flag to wontfix. Thanks!

Flags: needinfo?(bwerth)

Comment on attachment 9304956 [details]
Bug 1801819: Make SynchronousTask double-check its completion flag to identify timeouts.

Beta/Release Uplift Approval Request

  • User impact if declined: In unusual conditions, Windows clients may crash on shutdown.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: Unknown conditions to reproduce.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This enables an already-landed less-risky path for thread shutdown that may leak memory but definitely won't crash.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9304956 - Flags: approval-mozilla-beta?
Flags: needinfo?(bwerth)
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

Comment on attachment 9304956 [details]
Bug 1801819: Make SynchronousTask double-check its completion flag to identify timeouts.

Approved for 108.0b9

Attachment #9304956 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: