Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline]
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: RyanVM, Assigned: acreskey)
References
Details
(Keywords: crash, Whiteboard: [necko-triaged][necko-priority-queue])
Crash Data
Attachments
(2 files)
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-esr115+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-esr115+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/a1bf37d6-e50b-45ed-b277-a26cc0230713
MOZ_CRASH Reason: Shutdown hanging at step AppShutdownNetTeardown. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 ntdll.dll ZwWaitForAlertByThreadId
1 ntdll.dll RtlAcquireSRWLockExclusive
2 xul.dll mozilla::OffTheBooksMutex::Lock xpcom/threads/Mutex.h:65
2 xul.dll mozilla::detail::BaseAutoLock<mozilla::Mutex&>::BaseAutoLock xpcom/threads/Mutex.h:236
2 xul.dll mozilla::net::nsSocketTransportService::SetOffline netwerk/base/nsSocketTransportService2.cpp:894
3 xul.dll mozilla::net::nsIOService::SetOfflineInternal netwerk/base/nsIOService.cpp:1289
4 xul.dll mozilla::net::nsIOService::Observe netwerk/base/nsIOService.cpp:1635
5 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:70
5 xul.dll nsObserverService::NotifyObservers xpcom/ds/nsObserverService.cpp:291
6 xul.dll mozilla::AppShutdown::AdvanceShutdownPhaseInternal xpcom/base/AppShutdown.cpp:433
Seems like quite a few instances happening, I think exclusively, in the past week.
Affecting Nightly, release back to at least 109.0.1 , ESR back to at least 102.12.0esr.
I'm putting directly in necko-priority-queue for closer inspection.
My bad, my filter was set for a week. This bug goes back to early march.
https://crash-stats.mozilla.org/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3Anet%3A%3AnsSocketTransportService%3A%3ASetOffline&date=%3E%3D2023-01-14T18%3A34%3A00.000Z&date=%3C2023-07-14T18%3A34%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=53#graphs
Assignee | ||
Comment 3•2 years ago
|
||
Very roughly 15-30 shutdown crashes / day.
So maybe not the highest priority.
Looks like a deadlock in shutting down the socketTransportService, while trying to acquire this mutex
https://searchfox.org/mozilla-central/rev/2d06b7d5fbfa7a31776389da87c5f895b80df8d9/netwerk/base/nsSocketTransportService2.cpp#886
There are a few users of that mutex.
But possibly something in the 110.0.1 dot release, but I'm not positive
https://crash-stats.mozilla.org/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3Anet%3A%3AnsSocketTransportService%3A%3ASetOffline&date=%3E%3D2023-01-01T15%3A15%3A00.000Z&date=%3C2023-07-14T15%3A15%3A00.000Z#graphs
Comment 4•2 years ago
|
||
Shutdown hangs rarely would be an S2, IMO
A quick look implies 111, but it could be 110.something
There were occasional shutdown hangs there for a long time, though
Comment 5•2 years ago
|
||
Comment 7•2 years ago
|
||
The socket thread is stuck in select here. Definitely a deadlock.
3 | ws2_32.dll | select | cfi | |
---|---|---|---|---|
4 | nss3.dll | socket_io_wait(long long, int, unsigned int) | nsprpub/pr/src/md/windows/w95sock.c:695 | cfi |
5 | nss3.dll | _MD_Accept(PRFileDesc*, PRNetAddr*, unsigned int*, unsigned int) | nsprpub/pr/src/md/windows/w95sock.c:141 | inlined |
5 | nss3.dll | SocketAccept(PRFileDesc*, PRNetAddr*, unsigned int) | nsprpub/pr/src/io/prsocket.c:445 | cfi |
6 | nss3.dll | PR_Accept(PRFileDesc*, PRNetAddr*, unsigned int) | nsprpub/pr/src/io/priometh.c:167 | inlined |
6 | nss3.dll | PR_NewTCPSocketPair(PRFileDesc**) | nsprpub/pr/src/io/prsocket.c:1708 | cfi |
7 | xul.dll | mozilla::net::PollableEvent::PollableEvent() | netwerk/base/PollableEvent.cpp:179 | cfi |
8 | xul.dll | mozilla::net::nsSocketTransportService::Run() | netwerk/base/nsSocketTransportService2.cpp:1086 | cfi |
9 | xul.dll | nsThread::ProcessNextEvent(bool, bool*) | xpcom/threads/nsThread.cpp:1193 | cfi |
10 | xul.dll | NS_ProcessNextEvent(nsIThread*, bool) | xpcom/threads/nsThreadUtils.cpp:480 | inlined |
Assignee | ||
Updated•1 years ago
|
Comment 8•1 years ago
|
||
I had another short look at this. A few notes:
- crashing during shutdown
- mutex is already acquired on the socket thread in nsSocketTransportService::Run
NS_IMETHODIMP
nsSocketTransportService::Run() {
SOCKET_LOG(("STS thread init %d sockets\n", gMaxCount));
#if defined(XP_WIN)
// see bug 1361495, gethostname() triggers winsock initialization.
// so do it here (on parent and child) to protect against it being done first
// accidentally on the main thread.. especially via PR_GetSystemInfo(). This
// will also improve latency of first real winsock operation
// ..
// If STS-thread is no longer needed this should still be run before exiting
char ignoredStackBuffer[255];
Unused << gethostname(ignoredStackBuffer, 255);
#endif
psm::InitializeSSLServerCertVerificationThreads();
gSocketThread = PR_GetCurrentThread();
{
MutexAutoLock lock(mLock);
mPollableEvent.reset(new PollableEvent()); <--- this line hangs
After doing a bit of code archeology, I got to this code:
if (NewTCPSocketPair(fd, true)) {
mReadFD = fd[0];
mWriteFD = fd[1];
// If the previous fails try without recv buffer increase (bug 1305436).
} else if (NewTCPSocketPair(fd, false)) {
mReadFD = fd[0];
mWriteFD = fd[1];
// If both fail, try the old version.
} else if (PR_NewTCPSocketPair(fd) == PR_SUCCESS) {
So it seems we first try NewTCPSocketPair twice, and when that fails we call PR_NewTCPSocketPair which is what ends up hanging.
It seems to be related to bug 698882 and bug 1305436.
A quick thing we can do to avoid blocking the main thread is to allocate the pollableEvent outside the mutex. So instead of this:
https://searchfox.org/mozilla-central/rev/07548591c7125357ea27e7c2dbe9d610ea2d8f35/netwerk/base/nsSocketTransportService2.cpp#1078-1079
MutexAutoLock lock(mLock);
mPollableEvent.reset(new PollableEvent());
We'd try this:
PollableEvent* pollable = new PollableEvent();
MutexAutoLock lock(mLock);
mPollableEvent.reset(pollable);
This still hangs the socket thread, but maybe that's a bit better than also hanging the main thread?
Assignee | ||
Comment 9•1 year ago
|
||
We avoid blocking the main thread by allocating the PollableEvent outside the mutex. Still has the potential to hang the socket thread, but the main thread remains responsive.
Assignee | ||
Comment 10•1 year ago
|
||
Thanks for digging into this one Valentin.
And the call to nsSocketTransportService::Run()
during shutdown -- some code is trying to restart the nsSocketTransportService during shutdown?
Comment 11•1 year ago
•
|
||
That's a great question, Andrew. Presumably we should not attempt to reinitialize the socket thread if we are during shutdown here:
https://searchfox.org/mozilla-central/rev/64eef6b342bca86930caf3ba403c0f8ce621b474/netwerk/base/nsIOService.cpp#1288
InitializeSocketTransportService();
Maybe we could add a check to the beginning of InitializeSocketTransportService that checks:
if (AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownConfirmed)) {
return NS_OK;
}
I don't know if this explains why PR_NewTCPSocketPair is failing (I assume it was stuck from a previous call) but it's still a good way forward.
Comment 12•1 year ago
|
||
Comment 13•1 year ago
|
||
bugherder |
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Comment 14•1 year ago
|
||
I see crashes with this signature from Nightly in builds that contain this fix. Are they the same as this or something different?
Assignee | ||
Comment 15•1 year ago
•
|
||
Thanks Ryan,
Here's one such crash and I'm quite sure this is the same hang, with the socket thread locked up in the same spot
ws2_32.dll select
nss3.dll socket_io_wait(long long, int, unsigned int) nsprpub/pr/src/md/windows/w95sock.c:695 cfi
nss3.dll _PR_MD_CONNECT(PRFileDesc*, PRNetAddr const*, unsigned int, unsigned int) nsprpub/pr/src/md/windows/w95sock.c:168 inlined
nss3.dll SocketConnect(PRFileDesc*, PRNetAddr const*, unsigned int) nsprpub/pr/src/io/prsocket.c:274 cfi
nss3.dll PR_Connect(PRFileDesc*, PRNetAddr const*, unsigned int) nsprpub/pr/src/io/priometh.c:155 inlined
nss3.dll PR_NewTCPSocketPair(PRFileDesc**) nsprpub/pr/src/io/prsocket.c:1696 cfi
xul.dll mozilla::net::PollableEvent::PollableEvent() netwerk/base/PollableEvent.cpp:179 cfi
xul.dll mozilla::net::nsSocketTransportService::Run() netwerk/base/nsSocketTransportService2.cpp:1079 cfi
xul.dll nsThread::ProcessNextEvent(bool, bool*)
We split this bug into two parts, preventing re-initializing of socket transport during shutdown in this one, Bug 1843384.
And then we made bug 1853010 to prevent the above hang (if it were still to occur) from locking up the main thread.
Evidently the first fix didn't capture the scenario (but I think the code is still valuable).
So I think we should land the code from bug 1853010.
Would it be cleaner to land it against this bug?
cc: valentin
Reporter | ||
Comment 16•1 year ago
|
||
If the patch landed here didn't make an observable difference, I think reopening the bug is reasonable.
Assignee | ||
Comment 17•1 year ago
|
||
Thanks, that sounds simplest and best to me too.
Reporter | ||
Updated•1 year ago
|
Assignee | ||
Comment 18•1 year ago
|
||
We avoid blocking the main thread by allocating the PollableEvent outside the mutex. Still has the potential to hang the socket thread, but the main thread remains responsive.
Comment 19•1 year ago
|
||
Comment 20•1 year ago
|
||
bugherder |
Comment 21•1 year ago
|
||
The patch landed in nightly and beta is affected.
:acreskey, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox118
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 22•1 year ago
|
||
Set status-firefox118
to wontfix
as this is a shutdown hang and also because we'd like to observe the impact of this fix.
Reporter | ||
Comment 24•1 year ago
|
||
No crashes on Beta 119 - did you want to nominate this for ESR115 uplift? Both patches graft cleanly.
Assignee | ||
Comment 25•1 year ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)
No crashes on Beta 119 - did you want to nominate this for ESR115 uplift? Both patches graft cleanly.
I don't have the best understanding of the tradeoffs considered when nominating for an ESR uplift.
But we believe this to be a safer fix that only affects shutdown.
Assignee | ||
Comment 27•1 year ago
|
||
[Tracking Requested - why for this release]:
We believe that these patches are very safe (only affect shutdown) and we've seen that they stop the crash.
Reporter | ||
Comment 28•1 year ago
•
|
||
Approval requests are done on the patch (setting approval-mozilla-esr115
to ?
and filling out the form). Tracking is something else :)
Assignee | ||
Comment 29•1 year ago
|
||
Comment on attachment 9352531 [details]
Bug 1843384 - Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline] r=#necko-reviewers,valentin!
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: The two patches for this bug have resolved a longer running (although relatively infrequent) shutdown hang.
- User impact if declined: A very small percentage of users will see the parent process hang on shutdown.
- Fix Landed on Version: 119
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch is low risk because it only runs when app shutdown is confirmed.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 30•1 year ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #28)
Approval requests are done on the patch (setting
approval-mozilla-esr115
to?
and filling out the form). Tracking is something else :)
Thank you!
Reporter | ||
Comment 31•1 year ago
|
||
Comment on attachment 9352531 [details]
Bug 1843384 - Crash in [@ shutdownhang | mozilla::net::nsSocketTransportService::SetOffline] r=#necko-reviewers,valentin!
Approved for 115.5esr.
Reporter | ||
Updated•1 year ago
|
Comment 32•1 year ago
|
||
uplift |
Reporter | ||
Updated•1 year ago
|
Description
•