Closed Bug 1843384 Opened 2 years ago Closed 1 year ago

Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline]

Categories

(Core :: Networking, defect, P2)

All
Windows
defect

Tracking

()

RESOLVED FIXED
119 Branch
Tracking Status
firefox-esr102 --- wontfix
firefox-esr115 - fixed
firefox117 --- wontfix
firefox118 --- wontfix
firefox119 --- fixed

People

(Reporter: RyanVM, Assigned: acreskey)

References

Details

(Keywords: crash, Whiteboard: [necko-triaged][necko-priority-queue])

Crash Data

Attachments

(2 files)

Crash report: https://crash-stats.mozilla.org/report/index/a1bf37d6-e50b-45ed-b277-a26cc0230713

MOZ_CRASH Reason: Shutdown hanging at step AppShutdownNetTeardown. Something is blocking the main-thread.

Top 10 frames of crashing thread:

0  ntdll.dll  ZwWaitForAlertByThreadId  
1  ntdll.dll  RtlAcquireSRWLockExclusive  
2  xul.dll  mozilla::OffTheBooksMutex::Lock  xpcom/threads/Mutex.h:65
2  xul.dll  mozilla::detail::BaseAutoLock<mozilla::Mutex&>::BaseAutoLock  xpcom/threads/Mutex.h:236
2  xul.dll  mozilla::net::nsSocketTransportService::SetOffline  netwerk/base/nsSocketTransportService2.cpp:894
3  xul.dll  mozilla::net::nsIOService::SetOfflineInternal  netwerk/base/nsIOService.cpp:1289
4  xul.dll  mozilla::net::nsIOService::Observe  netwerk/base/nsIOService.cpp:1635
5  xul.dll  nsObserverList::NotifyObservers  xpcom/ds/nsObserverList.cpp:70
5  xul.dll  nsObserverService::NotifyObservers  xpcom/ds/nsObserverService.cpp:291
6  xul.dll  mozilla::AppShutdown::AdvanceShutdownPhaseInternal  xpcom/base/AppShutdown.cpp:433

Seems like quite a few instances happening, I think exclusively, in the past week.
Affecting Nightly, release back to at least 109.0.1 , ESR back to at least 102.12.0esr.

https://crash-stats.mozilla.org/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3Anet%3A%3AnsSocketTransportService%3A%3ASetOffline&date=%3E%3D2023-07-07T15%3A15%3A00.000Z&date=%3C2023-07-14T15%3A15%3A00.000Z&_sort=-date

I'm putting directly in necko-priority-queue for closer inspection.

Severity: -- → S2
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-queue]

Very roughly 15-30 shutdown crashes / day.
So maybe not the highest priority.

Looks like a deadlock in shutting down the socketTransportService, while trying to acquire this mutex
https://searchfox.org/mozilla-central/rev/2d06b7d5fbfa7a31776389da87c5f895b80df8d9/netwerk/base/nsSocketTransportService2.cpp#886
There are a few users of that mutex.

But possibly something in the 110.0.1 dot release, but I'm not positive
https://crash-stats.mozilla.org/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3Anet%3A%3AnsSocketTransportService%3A%3ASetOffline&date=%3E%3D2023-01-01T15%3A15%3A00.000Z&date=%3C2023-07-14T15%3A15%3A00.000Z#graphs

Shutdown hangs rarely would be an S2, IMO

A quick look implies 111, but it could be 110.something

There were occasional shutdown hangs there for a long time, though

Severity: S2 → S3

The socket thread is stuck in select here. Definitely a deadlock.

3 ws2_32.dll select cfi
4 nss3.dll socket_io_wait(long long, int, unsigned int) nsprpub/pr/src/md/windows/w95sock.c:695 cfi
5 nss3.dll _MD_Accept(PRFileDesc*, PRNetAddr*, unsigned int*, unsigned int) nsprpub/pr/src/md/windows/w95sock.c:141 inlined
5 nss3.dll SocketAccept(PRFileDesc*, PRNetAddr*, unsigned int) nsprpub/pr/src/io/prsocket.c:445 cfi
6 nss3.dll PR_Accept(PRFileDesc*, PRNetAddr*, unsigned int) nsprpub/pr/src/io/priometh.c:167 inlined
6 nss3.dll PR_NewTCPSocketPair(PRFileDesc**) nsprpub/pr/src/io/prsocket.c:1708 cfi
7 xul.dll mozilla::net::PollableEvent::PollableEvent() netwerk/base/PollableEvent.cpp:179 cfi
8 xul.dll mozilla::net::nsSocketTransportService::Run() netwerk/base/nsSocketTransportService2.cpp:1086 cfi
9 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1193 cfi
10 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:480 inlined
Assignee: nobody → acreskey

I had another short look at this. A few notes:

  • crashing during shutdown
  • mutex is already acquired on the socket thread in nsSocketTransportService::Run

https://searchfox.org/mozilla-central/rev/07548591c7125357ea27e7c2dbe9d610ea2d8f35/netwerk/base/nsSocketTransportService2.cpp#1057-1079

NS_IMETHODIMP
nsSocketTransportService::Run() {
  SOCKET_LOG(("STS thread init %d sockets\n", gMaxCount));

#if defined(XP_WIN)
  // see bug 1361495, gethostname() triggers winsock initialization.
  // so do it here (on parent and child) to protect against it being done first
  // accidentally on the main thread.. especially via PR_GetSystemInfo(). This
  // will also improve latency of first real winsock operation
  // ..
  // If STS-thread is no longer needed this should still be run before exiting

  char ignoredStackBuffer[255];
  Unused << gethostname(ignoredStackBuffer, 255);
#endif

  psm::InitializeSSLServerCertVerificationThreads();

  gSocketThread = PR_GetCurrentThread();

  {
    MutexAutoLock lock(mLock);
    mPollableEvent.reset(new PollableEvent()); <--- this line hangs

After doing a bit of code archeology, I got to this code:

https://searchfox.org/mozilla-central/rev/07548591c7125357ea27e7c2dbe9d610ea2d8f35/netwerk/base/PollableEvent.cpp#171-179

if (NewTCPSocketPair(fd, true)) {
  mReadFD = fd[0];
  mWriteFD = fd[1];
  // If the previous fails try without recv buffer increase (bug 1305436).
} else if (NewTCPSocketPair(fd, false)) {
  mReadFD = fd[0];
  mWriteFD = fd[1];
  // If both fail, try the old version.
} else if (PR_NewTCPSocketPair(fd) == PR_SUCCESS) {

So it seems we first try NewTCPSocketPair twice, and when that fails we call PR_NewTCPSocketPair which is what ends up hanging.
It seems to be related to bug 698882 and bug 1305436.

A quick thing we can do to avoid blocking the main thread is to allocate the pollableEvent outside the mutex. So instead of this:
https://searchfox.org/mozilla-central/rev/07548591c7125357ea27e7c2dbe9d610ea2d8f35/netwerk/base/nsSocketTransportService2.cpp#1078-1079

MutexAutoLock lock(mLock);
mPollableEvent.reset(new PollableEvent());

We'd try this:

PollableEvent* pollable = new PollableEvent();
MutexAutoLock lock(mLock);
mPollableEvent.reset(pollable);

This still hangs the socket thread, but maybe that's a bit better than also hanging the main thread?

We avoid blocking the main thread by allocating the PollableEvent outside the mutex. Still has the potential to hang the socket thread, but the main thread remains responsive.

Thanks for digging into this one Valentin.
And the call to nsSocketTransportService::Run() during shutdown -- some code is trying to restart the nsSocketTransportService during shutdown?

Flags: needinfo?(valentin.gosu)

That's a great question, Andrew. Presumably we should not attempt to reinitialize the socket thread if we are during shutdown here:
https://searchfox.org/mozilla-central/rev/64eef6b342bca86930caf3ba403c0f8ce621b474/netwerk/base/nsIOService.cpp#1288

InitializeSocketTransportService();

Maybe we could add a check to the beginning of InitializeSocketTransportService that checks:

  if (AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownConfirmed)) {
    return NS_OK;
  }

I don't know if this explains why PR_NewTCPSocketPair is failing (I assume it was stuck from a previous call) but it's still a good way forward.

Flags: needinfo?(valentin.gosu)
See Also: → 1853010
Pushed by acreskey@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2533739a555b Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline] r=valentin,necko-reviewers
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 119 Branch

I see crashes with this signature from Nightly in builds that contain this fix. Are they the same as this or something different?

Flags: needinfo?(acreskey)

Thanks Ryan,

Here's one such crash and I'm quite sure this is the same hang, with the socket thread locked up in the same spot

ws2_32.dll 	select
nss3.dll 	socket_io_wait(long long, int, unsigned int) 	nsprpub/pr/src/md/windows/w95sock.c:695 	cfi
nss3.dll 	_PR_MD_CONNECT(PRFileDesc*, PRNetAddr const*, unsigned int, unsigned int) 	nsprpub/pr/src/md/windows/w95sock.c:168 	inlined
nss3.dll 	SocketConnect(PRFileDesc*, PRNetAddr const*, unsigned int) 	nsprpub/pr/src/io/prsocket.c:274 	cfi
nss3.dll 	PR_Connect(PRFileDesc*, PRNetAddr const*, unsigned int) 	nsprpub/pr/src/io/priometh.c:155 	inlined
nss3.dll 	PR_NewTCPSocketPair(PRFileDesc**) 	nsprpub/pr/src/io/prsocket.c:1696 	cfi
xul.dll 	mozilla::net::PollableEvent::PollableEvent() 	netwerk/base/PollableEvent.cpp:179 	cfi
xul.dll 	mozilla::net::nsSocketTransportService::Run() 	netwerk/base/nsSocketTransportService2.cpp:1079 	cfi
xul.dll 	nsThread::ProcessNextEvent(bool, bool*)

We split this bug into two parts, preventing re-initializing of socket transport during shutdown in this one, Bug 1843384.
And then we made bug 1853010 to prevent the above hang (if it were still to occur) from locking up the main thread.

Evidently the first fix didn't capture the scenario (but I think the code is still valuable).
So I think we should land the code from bug 1853010.

Would it be cleaner to land it against this bug?
cc: valentin

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(ryanvm)
Flags: needinfo?(acreskey)

If the patch landed here didn't make an observable difference, I think reopening the bug is reasonable.

Flags: needinfo?(ryanvm)

Thanks, that sounds simplest and best to me too.

Status: RESOLVED → REOPENED
Flags: needinfo?(valentin.gosu)
Resolution: FIXED → ---
Target Milestone: 119 Branch → ---

We avoid blocking the main thread by allocating the PollableEvent outside the mutex. Still has the potential to hang the socket thread, but the main thread remains responsive.

Pushed by acreskey@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fa27440232f8 Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline] r=valentin,necko-reviewers
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 119 Branch

The patch landed in nightly and beta is affected.
:acreskey, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox118 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(acreskey)

Set status-firefox118 to wontfix as this is a shutdown hang and also because we'd like to observe the impact of this fix.

Flags: needinfo?(acreskey)
Regressions: 1853968
Duplicate of this bug: 1853010
See Also: 1853010
No longer duplicate of this bug: 1853010

No crashes on Beta 119 - did you want to nominate this for ESR115 uplift? Both patches graft cleanly.

Flags: needinfo?(acreskey)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)

No crashes on Beta 119 - did you want to nominate this for ESR115 uplift? Both patches graft cleanly.

I don't have the best understanding of the tradeoffs considered when nominating for an ESR uplift.
But we believe this to be a safer fix that only affects shutdown.

Flags: needinfo?(acreskey)

Go ahead and nominate for uplift.

Flags: needinfo?(acreskey)

[Tracking Requested - why for this release]:

We believe that these patches are very safe (only affect shutdown) and we've seen that they stop the crash.

Flags: needinfo?(acreskey)

Approval requests are done on the patch (setting approval-mozilla-esr115 to ? and filling out the form). Tracking is something else :)

Flags: needinfo?(acreskey)

Comment on attachment 9352531 [details]
Bug 1843384 - Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline] r=#necko-reviewers,valentin!

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: The two patches for this bug have resolved a longer running (although relatively infrequent) shutdown hang.
  • User impact if declined: A very small percentage of users will see the parent process hang on shutdown.
  • Fix Landed on Version: 119
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch is low risk because it only runs when app shutdown is confirmed.
Flags: needinfo?(acreskey)
Attachment #9352531 - Flags: approval-mozilla-esr115?
Attachment #9353766 - Flags: approval-mozilla-esr115?

(In reply to Ryan VanderMeulen [:RyanVM] from comment #28)

Approval requests are done on the patch (setting approval-mozilla-esr115 to ? and filling out the form). Tracking is something else :)

Thank you!

Comment on attachment 9352531 [details]
Bug 1843384 - Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline] r=#necko-reviewers,valentin!

Approved for 115.5esr.

Attachment #9352531 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
Attachment #9353766 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
See Also: → 1719046
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: