If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Crash in shutdownhang | PR_MD_WAIT_CV | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent

RESOLVED WORKSFORME

Status

()

Core
Networking: HTTP
--
critical
RESOLVED WORKSFORME
9 months ago
3 months ago

People

(Reporter: njn, Unassigned)

Tracking

({crash})

unspecified
Unspecified
Windows 10
crash
Points:
---

Firefox Tracking Flags

(firefox52 wontfix, firefox53 affected)

Details

(Whiteboard: [necko-next], crash signature)

(Reporter)

Description

9 months ago
This bug was filed from the Socorro interface and is 
report bp-d4fc7a65-cd4e-4db2-95d1-8cc152161210.
=============================================================

First occurrence of this was in Nightly 20161119030204. Since then it has occurred 57 times. Over the past week it's the #19 topcrash on Nightly; no sign of it on Aurora, which makes sense because the last merge date was November 14.

This is the regression window for that Nightly:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=28e2a6dde76ab6ad4464a3662df1bd57af04398a&tochange=f09e137ead39230eaa94f47988ccce2cfcda4195

But this crash is infrequent enough that it doesn't happen every day, so it's possible the true cause landed a day or two before that.

Honza, any ideas? The stack trace for the hanging thread contains network shutdown stuff -- though it varies a bit from crash to crash -- so it looks like it's related to networking.
Flags: needinfo?(honzab.moz)
Flags: needinfo?(honzab.moz) → needinfo?(mcmanus)
this is a pr_poll shutdiown hang - which is certainly not new (but is usually not win 10) .. maybe the signature is different or something about it is different.

dragana is the expert
Flags: needinfo?(mcmanus) → needinfo?(dd.mozilla)
We still do not have an easy way to analyze this shutdown hangs. The hangs here have may different causes, I had look at 20 crashes:


https://crash-stats.mozilla.com/report/index/ba47ab2d-0034-4b0c-bf05-d93c22161209 and https://crash-stats.mozilla.com/report/index/b14d7053-c702-4486-82e7-012212161212 and https://crash-stats.mozilla.com/report/index/df190ed9-44e4-4ffd-afcf-0fc272161212 - hang in mozilla::dom::workers::RuntimeService::Cleanup(); the socketThread shutdown have not started yet. In the first and the third crash, there is a thread that is in mozilla::dom::workers::WorkerPrivate::DoRunLoop(JSContext*) 	dom/workers/WorkerPrivate.cpp:4573, but I do not know what it is waiting for. In the second crash:
8 	xul.dll 	js::gc::GCRuntime::gc(JSGCInvocationKind, JS::gcreason::Reason) 	js/src/jsgc.cpp:6398
9 	xul.dll 	mozilla::dom::workers::WorkerPrivate::GarbageCollectInternal(JSContext*, bool, bool) 	dom/workers/WorkerPrivate.cpp:6267
10 	xul.dll 	`anonymous namespace'::GarbageCollectRunnable::WorkerRun 	dom/workers/WorkerPrivate.cpp:1394
11 	xul.dll 	mozilla::dom::workers::WorkerRunnable::Run()



https://crash-stats.mozilla.com/report/index/025116c3-a3d0-4f58-8142-7ef942161210 and https://crash-stats.mozilla.com/report/index/fa032da1-a037-4107-9f44-dce8c2161210 and https://crash-stats.mozilla.com/report/index/ebc28ddb-f9bf-4965-a9a3-af2442161212 - hangs in mozilla::dom::XMLHttpRequestMainThread::SendInternal(mozilla::dom::XMLHttpRequestMainThread::RequestBodyBase const*) 	dom/xhr/XMLHttpRequestMainThread.cpp:3011, but I am not sure what it is waiting for, maybe the request to finish, timer to expire etc.


https://crash-stats.mozilla.com/report/index/186dcd92-8c38-4f54-b4df-7837a2161210 - js runs on the main thread 


https://crash-stats.mozilla.com/report/index/e5083587-2a6e-48d7-931e-463962161210 and https://crash-stats.mozilla.com/report/index/d4fc7a65-cd4e-4db2-95d1-8cc152161210 and https://crash-stats.mozilla.com/report/index/47dc423a-34ae-44b9-a936-104162161212 and https://crash-stats.mozilla.com/report/index/4ed8dcf3-d5ba-4102-a193-a5e1e2161212 - our normal poll hangs



https://crash-stats.mozilla.com/report/index/fae718bb-0f61-4a77-b0fc-6c24a2161210 - This is waiting on socket thread to shutdown and socket thread is doing:
 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForSingleObject 	
1 	KERNELBASE.dll 	WaitForSingleObjectEx 	
2 	nss3.dll 	PR_MD_WAIT_CV 	nsprpub/pr/src/md/windows/w95cv.c:248
3 	nss3.dll 	PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c:172
4 	nss3.dll 	PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c:525
5 	nss3.dll 	PR_Wait 	nsprpub/pr/src/threads/prmon.c:298
6 	xul.dll 	mozilla::ReentrantMonitor::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/ReentrantMonitor.h:91
7 	xul.dll 	nsThread::Init() 	xpcom/threads/nsThread.cpp:660
8 	xul.dll 	nsThreadManager::NewThread(unsigned int, unsigned int, nsIThread**) 	xpcom/threads/nsThreadManager.cpp:260
9 	xul.dll 	nsThreadPool::PutEvent(already_AddRefed<nsIRunnable>, unsigned int) 	xpcom/threads/nsThreadPool.cpp:107
10 	xul.dll 	nsThreadPool::Dispatch(already_AddRefed<nsIRunnable>, unsigned int) 	xpcom/threads/nsThreadPool.cpp:275
11 	xul.dll 	mozilla::net::nsStreamTransportService::Dispatch(already_AddRefed<nsIRunnable>, unsigned int) 	netwerk/base/nsStreamTransportService.cpp:490
12 	xul.dll 	nsIEventTarget::Dispatch(nsIRunnable*, unsigned int) 	obj-firefox/dist/include/nsIEventTarget.h:37
13 	xul.dll 	nsAStreamCopier::PostContinuationEvent_Locked() 	xpcom/io/nsStreamUtils.cpp:476
14 	xul.dll 	nsAStreamCopier::PostContinuationEvent() 	xpcom/io/nsStreamUtils.cpp:467
15 	xul.dll 	nsAStreamCopier::OnOutputStreamReady(nsIAsyncOutputStream*) 	xpcom/io/nsStreamUtils.cpp:435
16 	xul.dll 	mozilla::net::nsSocketOutputStream::OnSocketReady(nsresult) 	netwerk/base/nsSocketTransport2.cpp:552
17 	xul.dll 	mozilla::net::nsSocketTransport::OnSocketEvent(unsigned int, nsresult, nsISupports*) 	netwerk/base/nsSocketTransport2.cpp:1845
18 	xul.dll 	mozilla::net::nsSocketEvent::Run() 	netwerk/base/nsSocketTransport2.cpp:85
19 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1213
20 	xul.dll 	mozilla::net::nsSocketTransportService::Run() 	netwerk/base/nsSocketTransportService2.cpp:939

https://crash-stats.mozilla.com/report/index/806797aa-be7f-4758-84f2-de4e82161211 and https://crash-stats.mozilla.com/report/index/aa60fdd0-7eb3-4522-96f8-0e8a62161211 - socketThread is already shutdown, We are waiting for the main thread to shutdown and I am not sure what the main thread is waiting for.

https://crash-stats.mozilla.com/report/index/1ccd079f-dcf5-4272-8f03-1de1e2161212 - not sure where it hangs

https://crash-stats.mozilla.com/report/index/769c61f2-b5d3-4878-96f9-2d5802161212 - hangs in trying to shutdown thread pool in 10 	xul.dll 	nsThread::Shutdown() 	xpcom/threads/nsThread.cpp:983
11 	xul.dll 	nsThreadPool::Shutdown() 	xpcom/threads/nsThreadPool.cpp:328
12 	xul.dll 	mozilla::net::nsStreamTransportService::Observe(nsISupports*, char const*, char16_t const*) 	netwerk/base/nsStreamTransportService.cpp:556


ttps://crash-stats.mozilla.com/report/index/4241b3dc-fbaa-4611-9c21-248d32161212 - socket thread is waiting for:
10 	xul.dll 	nsThread::Shutdown() 	xpcom/threads/nsThread.cpp:983
11 	xul.dll 	nsThreadPool::Shutdown() 	xpcom/threads/nsThreadPool.cpp:328
12 	xul.dll 	mozilla::psm::StopSSLServerCertVerificationThreads() 	security/manager/ssl/SSLServerCertVerification.cpp:207



https://crash-stats.mozilla.com/report/index/a94cf788-8027-49dc-868b-6dab32161212 -  I am not sure, but it is not socketThread, it is already closed.


https://crash-stats.mozilla.com/report/index/b5fe58f4-71bf-4a23-8420-173312161212 and https://crash-stats.mozilla.com/report/index/0cce1bcb-ce6d-493a-9039-817f32161212  - hang in SocketClose
Flags: needinfo?(dd.mozilla)
6 crashes in comment #2 are know shutdown hangs cause by a hang in poll and socketClose. the others are different. I assume that the number of the other crashes rose because of a regression.
I have only looked at Windows 10 crashes.
Dragana, seems like you've got this. Feel free to re-triage as appropriate.
Whiteboard: [necko-next]
Ben, can you look at workers crash? I am not sure your are the right person.

Nathan, did something change in the thread pool shutdown, we are hanging in it from a couple of different code paths?

Thanks!
Flags: needinfo?(nfroyd)
Flags: needinfo?(bkelly)
(In reply to Dragana Damjanovic [:dragana] from comment #6)
> Nathan, did something change in the thread pool shutdown, we are hanging in
> it from a couple of different code paths?

We didn't change anything in thread pool shutdown recently AFAIK, but...

(In reply to Nicholas Nethercote [:njn] from comment #0)
> First occurrence of this was in Nightly 20161119030204. Since then it has
> occurred 57 times. Over the past week it's the #19 topcrash on Nightly; no
> sign of it on Aurora, which makes sense because the last merge date was
> November 14.

...we did change some code around the internal event queues used by threads in bug 1306591, which landed on 2016-11-08.  Crashes showing up a week and a half later seem kind of unlikely to be correlated with that, but it's possible, I suppose?  We also changed some internal nsThread bits around for requestIdle callback (bug 1198381, landed 2016-10-30) and various followup bugs, though I don't think any of the followup bugs were that significant.

I'm not sure if that makes things any clearer.
Flags: needinfo?(nfroyd)

Comment 8

9 months ago

(In reply to Dragana Damjanovic [:dragana] from comment #2)
> https://crash-stats.mozilla.com/report/index/ba47ab2d-0034-4b0c-bf05-
> d93c22161209 and
> https://crash-stats.mozilla.com/report/index/b14d7053-c702-4486-82e7-
> 012212161212 and
> https://crash-stats.mozilla.com/report/index/df190ed9-44e4-4ffd-afcf-
> 0fc272161212 - hang in mozilla::dom::workers::RuntimeService::Cleanup(); the
> socketThread shutdown have not started yet. In the first and the third
> crash, there is a thread that is in
> mozilla::dom::workers::WorkerPrivate::DoRunLoop(JSContext*) 
> dom/workers/WorkerPrivate.cpp:4573, but I do not know what it is waiting
> for. In the second crash:
> 8 	xul.dll 	js::gc::GCRuntime::gc(JSGCInvocationKind, JS::gcreason::Reason) 
> js/src/jsgc.cpp:6398
> 9 	xul.dll 
> mozilla::dom::workers::WorkerPrivate::GarbageCollectInternal(JSContext*,
> bool, bool) 	dom/workers/WorkerPrivate.cpp:6267
> 10 	xul.dll 	`anonymous namespace'::GarbageCollectRunnable::WorkerRun 
> dom/workers/WorkerPrivate.cpp:1394
> 11 	xul.dll 	mozilla::dom::workers::WorkerRunnable::Run()
> 

Andrea do you have any thoughts on the worker specifics in comment 2?  It seems we can get into a case where the worker thread is stuck waiting for control runnables, but since the browser is shutting down we don't end up sending them the control runnable.
Flags: needinfo?(bkelly) → needinfo?(amarchesini)
See Also: → bug 1324856
https://crash-stats.mozilla.com/signature/?build_id=20170102004004&product=Firefox&release_channel=aurora&_sort=-date&platform=Windows&signature=shutdownhang%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent%20%7C%20nsThread%3A%3AnsChainedEventQueue%3A%3AGetEvent&date=%3E%3D2017-01-02T00%3A00%3A00.000Z&date=%3C2017-01-04T13%3A23%3A00.000Z#aggregations on aurora looks similar to this one.
See Also: bug 1324856
Aurora has what appears to be a very similar version of this?
https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent%20%7C%20nsThread%3A%3AnsChainedEventQueue%3A%3AGetEvent
status-firefox52: --- → affected
status-firefox53: --- → affected
It seems that the worker is waiting for something. The runnable is blocked, and the worker cannot be released.

3 	xul.dll 	js::ConditionVariable::wait(js::LockGuard<js::Mutex>&) 	js/src/threading/windows/ConditionVariable.cpp:364
4 	xul.dll 	js::GCHelperState::waitBackgroundSweepEnd() 	js/src/jsgc.cpp:3356
5 	xul.dll 	js::gc::GCRuntime::incrementalCollectSlice(js::SliceBudget&, JS::gcreason::Reason, js::AutoLockForExclusiveAccess&) 	js/src/jsgc.cpp:5927
6 	xul.dll 	js::gc::GCRuntime::gcCycle(bool, js::SliceBudget&, JS::gcreason::Reason) 	js/src/jsgc.cpp:6183
7 	xul.dll 	js::gc::GCRuntime::collect(bool, js::SliceBudget, JS::gcreason::Reason) 	js/src/jsgc.cpp:6337
8 	xul.dll 	js::gc::GCRuntime::gc(JSGCInvocationKind, JS::gcreason::Reason) 	js/src/jsgc.cpp:6398
9 	xul.dll 	mozilla::dom::workers::WorkerPrivate::GarbageCollectInternal(JSContext*, bool, bool) 	dom/workers/WorkerPrivate.cpp:6267
Flags: needinfo?(amarchesini)
The similar signature mentioned in comment 10 was filed as bug 1329966.
Too late for firefox 52, mass-wontfix.
status-firefox52: affected → wontfix

Comment 14

3 months ago
The crash no longer happens after build id 20170228030203.  Shall we close it by same reason as bug 1329966?
Flags: needinfo?(madperson)

Comment 15

3 months ago
probably yes - those shifting shutdownhang signatures are really getting confusing. the crash in comment #0 seems to be the same one as discussed in bug 1158189.
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Flags: needinfo?(madperson)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.