Closed Bug 1375344 Opened 7 years ago Closed 6 years ago

Crash in shutdownhang | kernelbase.dll@0xcaf18

Categories

(Core :: General, defect, P3)

54 Branch
x86
Windows 10
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox54 --- affected

People

(Reporter: gchang, Unassigned)

References

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-6f6e1887-4868-468a-9fc5-c10ec0170621.
=============================================================
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForSingleObject 	
Ø 1 	kernelbase.dll 	kernelbase.dll@0xcaf18 	
Ø 2 	kernelbase.dll 	kernelbase.dll@0xcae71 	
3 	winscard.dll 	CSCardSubcontext::WaitForAvailable() 	
4 	winscard.dll 	CSCardSubcontext::ReleaseContext() 	
5 	winscard.dll 	CSCardUserContext::ReleaseContext() 	
6 	winscard.dll 	SCardReleaseContext 	
Ø 7 	tokenmgr.dll 	tokenmgr.dll@0x37d7 	
8 	kernel32.dll 	LoadEnclaveData 	
Ø 9 	tokenmgr.dll 	tokenmgr.dll@0x4d14 	
Ø 10 	tokenmgr.dll 	tokenmgr.dll@0x2dd1 	
Ø 11 	wdpkcs.dll 	wdpkcs.dll@0x351af 	
Ø 12 	wdpkcs.dll 	wdpkcs.dll@0x35caf 	
Ø 13 	wdpkcs.dll 	wdpkcs.dll@0x28328 	
14 	nss3.dll 	SECMOD_CancelWait 	security/nss/lib/pk11wrap/pk11util.c:1222
15 	xul.dll 	SmartCardMonitoringThread::~SmartCardMonitoringThread() 	security/manager/ssl/nsSmartCardMonitor.cpp:184
16 	xul.dll 	SmartCardThreadEntry::~SmartCardThreadEntry() 	security/manager/ssl/nsSmartCardMonitor.cpp:109
17 	xul.dll 	SmartCardThreadEntry::`scalar deleting destructor'(unsigned int) 	
18 	xul.dll 	nsNSSComponent::ShutdownNSS() 	security/manager/ssl/nsNSSComponent.cpp:2070
19 	xul.dll 	nsNSSComponent::DoProfileBeforeChange() 	security/manager/ssl/nsNSSComponent.cpp:2314
20 	xul.dll 	nsNSSComponent::Observe(nsISupports*, char const*, char16_t const*) 	security/manager/ssl/nsNSSComponent.cpp:2149
21 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverList.cpp:112
22 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverService.cpp:281
23 	xul.dll 	nsXREDirProvider::DoShutdown() 	toolkit/xre/nsXREDirProvider.cpp:1248
24 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1417
25 	xul.dll 	mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) 	obj-firefox/dist/include/mozilla/UniquePtr.h:345
26 	xul.dll 	mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::operator=(std::nullptr_t) 	obj-firefox/dist/include/mozilla/UniquePtr.h:313
27 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4705
28 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4768
29 	xul.dll 	mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/Bootstrap.cpp:45
30 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
31 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
32 	kernel32.dll 	BaseThreadInitThunk 	
33 	ntdll.dll 	__RtlUserThreadStart 	
34 	ntdll.dll 	_RtlUserThreadStart

This is #7 topcrash and there is a spike in the last 3 days.
Hi Nathan,
Can you help find someone to look at this?
Flags: needinfo?(nfroyd)
The crash in comment 0 comes from something NSS-y; there's another thread stuck doing smart card things:

Thread 26
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlpWaitOnAddressWithTimeout 	
2 	ntdll.dll 	RtlpWaitOnAddress 	
3 	ntdll.dll 	RtlpWaitOnCriticalSection 	
4 	ntdll.dll 	RtlpEnterCriticalSectionContended 	
5 	ntdll.dll 	RtlEnterCriticalSection 	
6 	nss3.dll 	PR_Lock 	nsprpub/pr/src/threads/combined/prulock.c:213
7 	nss3.dll 	SECMOD_WaitForAnyTokenEvent 	security/nss/lib/pk11wrap/pk11util.c:1148
8 	xul.dll 	SmartCardMonitoringThread::Execute() 	security/manager/ssl/nsSmartCardMonitor.cpp:344
9 	xul.dll 	SmartCardMonitoringThread::LaunchExecute(void*) 	security/manager/ssl/nsSmartCardMonitor.cpp:397
10 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397

which looks like we might have deadlocked?  ni keeler to evaluate.
Flags: needinfo?(nfroyd) → needinfo?(dkeeler)
However, there are a few other crashes that I looked at with this signature that look more like network cache crashes.  For instance:

https://crash-stats.mozilla.com/report/index/b7c4b6cb-5e1b-43d4-b2b0-868420170623
https://crash-stats.mozilla.com/report/index/e3e93836-348b-4908-bccb-1cd280170623
https://crash-stats.mozilla.com/report/index/ca9dbe9c-21da-4d49-a85f-8b8bd0170623

The first one says the main thread is stuck:

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForSingleObject 	
Ø 1 	kernelbase.dll 	kernelbase.dll@0xcaf18 	
Ø 2 	kernelbase.dll 	kernelbase.dll@0xcae71 	
3 	xul.dll 	mozilla::net::detail::BlockingIOWatcher::WatchAndCancel(mozilla::Monitor&) 	netwerk/cache2/CacheIOThread.cpp:189
4 	xul.dll 	mozilla::net::CacheIOThread::CancelBlockingIO() 	netwerk/cache2/CacheIOThread.cpp:417
5 	xul.dll 	mozilla::net::ShutdownEvent::PostAndWait() 	netwerk/cache2/CacheFileIOManager.cpp:587
6 	xul.dll 	mozilla::net::CacheFileIOManager::Shutdown() 	netwerk/cache2/CacheFileIOManager.cpp:1160
7 	xul.dll 	mozilla::net::CacheObserver::Observe(nsISupports*, char const*, char16_t const*) 	netwerk/cache2/CacheObserver.cpp:542
8 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverList.cpp:112
9 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverService.cpp:281

and it looks like another thread is off doing cache I/O:

Thread 18
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtClose 	
Ø 1 	KERNELBASE.dll 	KERNELBASE.dll@0xcadc9 	
2 	nss3.dll 	_MD_CloseFile 	nsprpub/pr/src/md/windows/w95io.c:403
3 	nss3.dll 	FileClose 	nsprpub/pr/src/io/prfile.c:207
4 	nss3.dll 	PR_Close 	nsprpub/pr/src/io/priometh.c:104
5 	xul.dll 	mozilla::net::CacheFileIOManager::MaybeReleaseNSPRHandleInternal(mozilla::net::CacheFileHandle*, bool) 	netwerk/cache2/CacheFileIOManager.cpp:2323
6 	xul.dll 	mozilla::net::ReleaseNSPRHandleEvent::Run() 	netwerk/cache2/CacheFileIOManager.cpp:857
7 	xul.dll 	mozilla::net::CacheIOThread::LoopOneLevel(unsigned int) 	netwerk/cache2/CacheIOThread.cpp:565
8 	xul.dll 	mozilla::net::CacheIOThread::ThreadFunc() 	netwerk/cache2/CacheIOThread.cpp:503
9 	xul.dll 	mozilla::net::CacheIOThread::ThreadFunc(void*) 	netwerk/cache2/CacheIOThread.cpp:446
10 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397

Or the second one, where the main thread is hanging:

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForSingleObject 	
Ø 1 	kernelbase.dll 	kernelbase.dll@0xcaf18 	
Ø 2 	kernelbase.dll 	kernelbase.dll@0xcae71 	
3 	nss3.dll 	_PR_MD_WAIT_CV 	nsprpub/pr/src/md/windows/w95cv.c:248
4 	nss3.dll 	_PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c:172
5 	nss3.dll 	PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c:525
6 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:79
7 	xul.dll 	mozilla::net::ShutdownEvent::PostAndWait() 	netwerk/cache2/CacheFileIOManager.cpp:582
8 	xul.dll 	mozilla::net::CacheFileIOManager::Shutdown() 	netwerk/cache2/CacheFileIOManager.cpp:1160
9 	xul.dll 	mozilla::net::CacheObserver::Observe(nsISupports*, char const*, char16_t const*) 	netwerk/cache2/CacheObserver.cpp:542
10 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverList.cpp:112

and another thread is off doing things, slightly different from the previous:

Thread 20
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtSetInformationFile 	
Ø 1 	KERNELBASE.dll 	KERNELBASE.dll@0xdc945 	
Ø 2 	iNetSafe.dll 	iNetSafe.dll@0x5630 	
Ø 3 	KERNELBASE.dll 	KERNELBASE.dll@0xdc75a 	
Ø 4 	KERNELBASE.dll 	KERNELBASE.dll@0xdc736 	
5 	xul.dll 	nsLocalFile::CopySingleFile(nsIFile*, nsIFile*, nsAString_internal const&, unsigned int) 	xpcom/io/nsLocalFileWin.cpp:1982
6 	xul.dll 	nsLocalFile::CopyMove(nsIFile*, nsAString_internal const&, unsigned int) 	xpcom/io/nsLocalFileWin.cpp:2103
7 	xul.dll 	nsLocalFile::MoveToNative(nsIFile*, nsACString_internal const&) 	xpcom/io/nsLocalFileWin.cpp:3628
8 	xul.dll 	mozilla::net::CacheFileIOManager::DoomFileInternal(mozilla::net::CacheFileHandle*, mozilla::net::CacheFileIOManager::PinningDoomRestriction) 	netwerk/cache2/CacheFileIOManager.cpp:2144
9 	xul.dll 	mozilla::net::DoomFileEvent::Run() 	netwerk/cache2/CacheFileIOManager.cpp:782
10 	xul.dll 	mozilla::net::CacheIOThread::LoopOneLevel(unsigned int) 	netwerk/cache2/CacheIOThread.cpp:565
11 	xul.dll 	mozilla::net::CacheIOThread::ThreadFunc() 	netwerk/cache2/CacheIOThread.cpp:503
12 	xul.dll 	mozilla::net::CacheIOThread::ThreadFunc(void*) 	netwerk/cache2/CacheIOThread.cpp:446
13 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397

ni to mayhemer for cache knowledge, and ni back to gchang to see if it's possible to have these stacks processed differently so the different crashes show up as, well, different crashes.
Flags: needinfo?(honzab.moz)
Flags: needinfo?(gchang)
Not much we can do about stacks like in comment 0 - the frames below 14 are completely out of our control. That's a 3rd party PKCS#11 module (i.e. external code the user loaded into Firefox's memory space). For a long time I've been thinking about writing a PKCS#11 module that would load another given module in a child process that would prevent things like this from hanging/crashing Firefox, but since only a small percentage of our users actually use PKCS#11 modules, it's been hard to justify the effort.
Flags: needinfo?(dkeeler)
See Also: → 1248818
(In reply to Nathan Froyd [:froydnj] from comment #3)
> However, there are a few other crashes that I looked at with this signature
> that look more like network cache crashes.  For instance:

All of these are known.  The code from the first two stacks combo is already trying to handle when the IO thread we wait for (on the main thread) to shutdown by telling it to cancel the current sync IO operation.  Regardless the windows documentation for the function used it in most cases doesn't work anyway.  It's not worse than what we had before, tho.  There is also a switch to turn this "cancel sync IO" feature off, I think, but we may get back to even worse state.  Note that after early shutdown we forbid most if not all of the cache background IO and just leak the opened file handles (only in release builds w/o leak checking!)

I've spent huge amount of time on this already and I'm not sure what better we could do side by using mmap or fully async IO on some version windows that support that.  Note that sync IO on windows often gets stuck for probably extremely long times, we don't know the cause.  The crash rate become lower recently, so we decided to not invest more time here.
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #5)
> (In reply to Nathan Froyd [:froydnj] from comment #3)
> > However, there are a few other crashes that I looked at with this signature
> > that look more like network cache crashes.  For instance:
> 
> All of these are known.  The code from the first two stacks combo is already
> trying to handle when the IO thread we wait for (on the main thread) to
> shutdown by telling it to cancel the current sync IO operation.  Regardless
> the windows documentation for the function used it in most cases doesn't
> work anyway.  It's not worse than what we had before, tho.  There is also a
> switch to turn this "cancel sync IO" feature off, I think, but we may get
> back to even worse state.  Note that after early shutdown we forbid most if
> not all of the cache background IO and just leak the opened file handles
> (only in release builds w/o leak checking!)
> 
> I've spent huge amount of time on this already and I'm not sure what better
> we could do side by using mmap or fully async IO on some version windows
> that support that.  Note that sync IO on windows often gets stuck for
> probably extremely long times, we don't know the cause.  The crash rate
> become lower recently, so we decided to not invest more time here.

Thanks for the detailed response, very helpful!  Do you have links to the other crashes/hangs?
Flags: needinfo?(honzab.moz)
Flags: needinfo?(gchang)
Moving this to P3 given low number of crashes and lack of resources to tackle it.
Priority: -- → P3
Closing because no crash reported since 12 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.