Closed Bug 1405290 Opened 7 years ago Closed 6 years ago

Shutdownhang in mozilla::dom::workers::RuntimeService::Cleanup rising in Firefox 56

Categories

(Core :: DOM: Workers, defect, P2)

56 Branch
All
Windows
defect

Tracking

()

RESOLVED DUPLICATE of bug 1437575
Tracking Status
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 + wontfix
firefox58 --- wontfix
firefox59 + wontfix
firefox60 + fixed
firefox61 + fixed

People

(Reporter: philipp, Assigned: baku)

References

(Blocks 3 open bugs)

Details

(4 keywords, Whiteboard: [tbird crash])

Crash Data

Attachments

(4 files)

This bug was filed from the Socorro interface and is 
report bp-6e92b73c-bbec-4833-9b6e-524ec0170929.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableCS 	
2 	kernelbase.dll 	SleepConditionVariableCS 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:68
5 	xul.dll 	nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsEventQueue.cpp:76
6 	xul.dll 	nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:839
7 	xul.dll 	nsThread::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:1297
8 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1380
9 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:480
10 	xul.dll 	mozilla::dom::workers::RuntimeService::Cleanup() 	dom/workers/RuntimeService.cpp:2239
11 	xul.dll 	mozilla::dom::workers::RuntimeService::Observe(nsISupports*, char const*, char16_t const*) 	dom/workers/RuntimeService.cpp:2761
12 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverList.cpp:112
13 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverService.cpp:295
14 	xul.dll 	mozilla::ShutdownXPCOM(nsIServiceManager*) 	xpcom/build/XPCOMInit.cpp:894
15 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1475
16 	xul.dll 	mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) 	obj-firefox/dist/include/mozilla/UniquePtr.h:345
17 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4812
18 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4879
19 	xul.dll 	mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/Bootstrap.cpp:45
20 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
21 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
22 	kernel32.dll 	BaseThreadInitThunk 	
23 	ntdll.dll 	__RtlUserThreadStart 	
24 	ntdll.dll 	_RtlUserThreadStart

these shutdownhangs have been around for a long time, but their frequency has noticably increased in firefox 56. in version 56.0 shutdownhangs with mozilla::dom::workers::RuntimeService::Cleanup in their proto signature account for 5.7% of all browser crashes while they were just 0.7% in 55.0.3 ...

the volume of those crashes during the 56.0b cycle seems to have jumped up with 56.0b12 - the changes in b12 were: https://mzl.la/2vVyxo4

Correlations for Firefox Release
(97.17% in signature vs 07.61% overall) Module "pcacli.dll" = true ("Program Compatibility Assistant Client Module")
(96.83% in signature vs 09.87% overall) Module "sfc_os.dll" = true

Correlations for Firefox Beta
(99.01% in signature vs 01.98% overall) Module "twinapi.dll" = true
(96.06% in signature vs 10.48% overall) Module "mpr.dll" = true [100.0% vs 04.88% if platform_version = 6.3.9600]
(100.0% in signature vs 07.35% overall) Module "explorerframe.dll" = true
(98.52% in signature vs 13.07% overall) Module "urlmon.dll" = true [100.0% vs 06.54% if platform_version = 6.3.9600]
(99.01% in signature vs 07.54% overall) Module "FWPUCLNT.DLL" = true
(96.06% in signature vs 04.56% overall) Module "ntshrui.dll" = true
(96.06% in signature vs 04.57% overall) Module "cscapi.dll" = true
(93.10% in signature vs 01.20% overall) Module "pcacli.dll" = true
See Also: → 1267692
Assignee: nobody → amarchesini
Priority: -- → P1
I don't think we can do anything for 56. I'm planning to work on shutting down crashes and so on as next task (probably). We are talking about 58 probably.
OK, marking wontfix for 56 and 57. Since I'm not even sure we'll get to this in 58, I've marked it as fix-optional.
This seems to be spiking recently. Is there anything we can do for 57?
Flags: needinfo?(amarchesini)
I added one similar signature I have seen in Mac crash stats that is affecting 57/58, but not 56: http://bit.ly/2kyerii
Crash Signature: [@ shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] → [@ shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::dom::workers::RuntimeService::Cl…
Many user comments on these crashes. I think we should reassess this issue for 57  (and maybe even 56) as it is high volume and seems to annoy users. Is there anything we can do here?
Since we are tracking this for 57, it would great if :baku could weigh in on Comment 3.
Can we get some common URLs people are hitting this with? I know it's a shutdown issue but without an STR or sites to reproduce it's hard to get anywhere here :)
Flags: needinfo?(amarchesini) → needinfo?(lhenry)
The timing of the spike here looks similar to bug 1404105 (stemming from bug 1047098).
Maybe something around http://searchfox.org/mozilla-central/source/browser/base/content/sanitize.js#299. Andrew, you were involved with bug 1047098; any thoughts related to this worker-related shutdown crash?
Flags: needinfo?(bugmail)
The urls seem to mostly be FB, gmail, youtube and maybe I could pick out weather pages, but nothing that seems obviously reproducible. Many comments do mention having shut down firefox and walked away, then come back to the crash report, but I think even more comments are about Firefox actually crashing.
Flags: needinfo?(lhenry)
This needs more investigation, but the short answer is this does not appear to have anything to do with sanitize.js changes.
- For release and trunk builds, we're ending up with 1 or 2 workers that are hanging out in their run loops at "xpcom-shutdown-threads", not having noticed that they should be dead.
  - My best guess is an edge-case is resulting in WorkerHolders releasing themselves on the wrong thread.  Although we have NS_ASSERT_OWNINGTHREAD and AssertIsOnWorkerThread() checks, they both effectively only happen on DEBUG builds.  (NS_ASSERT_OWNINGTHREAD is enabled on nightlies, but only if !MOZ_PROFILING[1], and it so happens that every nightly config does "--enbale-profiling", so that's not helping.  AssertIsOnWorkerThread is explicitly DEBUG-only.)
- For esr52 builds, I'm seeing more cases where there workers are actively trying to GC.  In some cases, the GC appears to be hung waiting on joining GlobalHelperThreadState locks.  In at least one case, 8 worker threads were all busy in GC.  I'm really hoping that's not bitcoin mining related.

Options for further diagnosis are:
- Ensure that we trigger a crash on non-DEBUG nightly builds if WorkerHolder::ReleaseWorkerInternal occurs off the worker thread.
- Make the RuntimeService "xpcom-shutdown-threads" Cleanup() function some number of seconds of grace before it triggers a crash that identifies if HasActiveHolders() is returning true for a given worker, indicating if it's due to mChildWorkers, mTimeouts, or mHolders.  If mHolders, identify the holder that betrayed us.  The simplest approach for all of these is just to have distinct MOZ_CRASH calls, adding virtual WorkerHolder::BadWorkerHolderCrashYourself to help.

Options for blind mitigation:
- Have RuntimeService::Cleanup dispatch a ControlRunnable at all still-alive workers.  The arrival of such a runnable would address the wrong-thread RemoveHolder call if that's what happening.


1: https://searchfox.org/mozilla-central/rev/1285ae3e810e2dbf2652c49ee539e3199fcfe820/xpcom/base/nsDebug.h#361
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #11)
> This needs more investigation

baku knows the worker code so maybe he can help (or a thought will be triggered by comment 11?).

> Options for blind mitigation:
> - Have RuntimeService::Cleanup dispatch a ControlRunnable at all still-alive
> workers.  The arrival of such a runnable would address the wrong-thread
> RemoveHolder call if that's what happening.

Is this something we'd want to do in 57?
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)
re: comment 12, off-bug we determined I'd try and look for likely WorkerHolder changes that might be involved.

From revision spelunking, changes to FetchConsumer in https://bugzilla.mozilla.org/show_bug.cgi?id=1381748 got uplifted for beta 11.  There was a notable change in behavior in part 3 of the patch at https://hg.mozilla.org/releases/mozilla-beta/rev/335fe07a77ed where the WorkerHolder's Notify method changed from synchronously removing the WorkerHolder via ContinueConsumeBody to asynchronously releasing them, cases described at https://bugzilla.mozilla.org/show_bug.cgi?id=1381748#c7.  It's quite possible those cases are sound, but something like bug 1395837 violates the assumed invariant (but which should hopefully now be fixed by 1391693).

Note that this is just informed guesswork.  In particular, comment 0 identifies b12 as the likely build where things got worse, but I'm finding it really difficult to contextualize the crash rates using the crash-stats UI beyond the absolute crash numbers.  :philipp, do you have any tips or documentation links that could help me better understand your b12 conclusion and help me be better able to do such analyses myself?  Thanks!  (It's okay if the answer is "hard-won experience" ;)
Flags: needinfo?(madperson)
(In reply to Andrew Sutherland [:asuth] from comment #13)
> In particular, comment 0
> identifies b12 as the likely build where things got worse, but I'm finding
> it really difficult to contextualize the crash rates using the crash-stats
> UI beyond the absolute crash numbers.  :philipp, do you have any tips or
> documentation links that could help me better understand your b12 conclusion
> and help me be better able to do such analyses myself?  Thanks!  (It's okay
> if the answer is "hard-won experience" ;)

investigating these shutdowhang crashes is a bit odd since their signatures often shifted around in the past - so for such an edgecase i don't think that there's any existing documentation, but i can try to explain how i came to the conclusion for the particular case:

i was also just starting off with the absolute number of shutdownhang reports on the beta channel with the particular proto signature & faceted by their version number:
https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Adom%3A%3Aworkers%3A%3ARuntimeService%3A%3ACleanup&release_channel=beta&submitted_from_infobar=%21__true__&product=Firefox&date=%3E%3D2017-01-01T01%3A00%3A00.000Z&_facets=signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version
in there it's apparent that the number of those crashes numbered between 500-700 per version in the 56.0b cycle (56.0b6 was higher because we skipped shipping b7 for a different reason), this is already a fair bit higher than in prior cycles. then it jumped up to over 2000 in 56.0b12 & the subsequent rc builds though - incidentally revisiting this query now shows that things have much improved in 57.0b again.

the most common crash signature for the issue at the time of 56.0b was [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] - it's also possible to graph out how it developed during the 56.0b cycle and it's apparent that it started spiking up after 2017-09-17 as well: 
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&proto_signature=~mozilla%3A%3Adom%3A%3Aworkers%3A%3ARuntimeService%3A%3ACleanup&submitted_from_infobar=%21__true__&signature=shutdownhang%20%7C%20NtWaitForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-06-01T02%3A00%3A00.000Z#graphs - after bug 1402037 landed on 2017-09-27 those signatures should have mostly morphed into [@shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup]
Flags: needinfo?(madperson)
Any updates on this crash? It is the #1 top crash signature in 56, even higher than "OOM | small". It only seems to affect Windows.

The crash report URLs are the usual collection of facebook.com and youtube.com URLs.
Crash Signature: mozilla::dom::workers::RuntimeService::Cleanup ] → mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup]
I agree with asuth. I would like to fix this bug in 2 steps: I'm writing a couple of patches in order to have more data about what is keeping the workers alive. After that, we can understand which API is blocking the worker shutting down.

The first step is going to happen today. My patches are almost ready.
Flags: needinfo?(amarchesini)
These 2 patches are temporary. We need something like that to find the reason of this crash. Than we can revert them, if we want.
Attachment #8927819 - Flags: review?(bugmail)
Attachment #8927822 - Flags: review?(bugmail)
Comment on attachment 8927819 [details] [diff] [review]
part 1 - WorkerHolder must assert in release

Review of attachment 8927819 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerHolder.cpp
@@ +14,5 @@
> +void
> +AssertOnOwningThread(void* aThread)
> +{
> +  if (MOZ_UNLIKELY(aThread != GetCurrentVirtualThread())) {
> +    MOZ_CRASH_UNSAFE_OOL("WorkerHolder on the wrong thread.");

Can't this can just be a normal MOZ_CRASH()?
Attachment #8927819 - Flags: review?(bugmail) → review+
Attachment #8927822 - Flags: review?(bugmail) → review+
Comment on attachment 8927821 [details] [diff] [review]
part 2 - RuntimeService::CrashIfHanging()

Review of attachment 8927821 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/RuntimeService.cpp
@@ +2198,5 @@
> +  for (uint32_t i = 0; i < workers.Length(); ++i) {
> +    WorkerPrivate* workerPrivate = workers[i];
> +
> +    // BC: Busy Count
> +    msg.AppendPrintf("-BC:%d", workerPrivate->BusyCount());

nit: maybe add some whitespace?  Like " -BC:%d "?

@@ +2208,5 @@
> +    msg.Append(runnable->MsgData());
> +  }
> +
> +  // This string will be leaked.
> +  MOZ_CRASH_UNSAFE_OOL(strdup(msg.BeginReading()));

Data review request is here:  This string is a computed list of WorkerHolder types (from part 3, all of which are string literals like "XMLHttpRequestWorker") plus the above worker counts *for things that failed to shut down* properly.

We'd expect this to look like the following as currently implemented in the patches:
"Workers Hanging - A:2|S:0|Q:0-BC:1XMLHttpRequestWorker-BC2:BroadcastChannelWorkerHolder|WebSocketWorkerHolder"

No origins will be explicitly revealed.  This is unlikely to aid in fingerprinting from crash stats unless it turns out there's a single website with a very distinctive set of operations that trigger this phenomenon.  Based on the URL's we've seen in the crash reports, we do not believe this to be the case, with this being a somewhat widespread problem, origin-wise.
Attachment #8927821 - Flags: review?(francois)
Attachment #8927821 - Flags: review?(bugmail)
Attachment #8927821 - Flags: review+
Comment on attachment 8927821 [details] [diff] [review]
part 2 - RuntimeService::CrashIfHanging()

Review of attachment 8927821 [details] [diff] [review]:
-----------------------------------------------------------------

datareview+

::: dom/workers/RuntimeService.cpp
@@ +2208,5 @@
> +    msg.Append(runnable->MsgData());
> +  }
> +
> +  // This string will be leaked.
> +  MOZ_CRASH_UNSAFE_OOL(strdup(msg.BeginReading()));

Based on this description, this looks like Category 1 data.
Attachment #8927821 - Flags: review?(francois) → review+
I think we're all good to land these other than the part 1 nit.  Setting NI to land or indicate next step.
Flags: needinfo?(bugmail) → needinfo?(amarchesini)
Last push to try before landing... it should be in m-i for tonight. max tomorrow morning.
Flags: needinfo?(amarchesini)
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3d7896bf89f7
Improve logging of workers when shutting down - part 1 - WorkerHolder must assert in release, r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb508b270fab
Improve logging of workers when shutting down - part 2 - Implement RuntimeService::CrashIfHanging(), r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/dec3b1ceaa9b
Improve logging of workers when shutting down - part 3 - WorkerHolder with names, r=asuth
baku suggested we uplift these patches to 52 to get diagnostic messages from there, too, since it's different enough from the majority-e10s-multi users in later releases.
Comment on attachment 8927819 [details] [diff] [review]
part 1 - WorkerHolder must assert in release

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined: This set of patches are helping the debugging of a workers shutdown hanging. But they are not really useful for e10s in production because the crash happens only if the hanging is done on the parent process. 52 is still non-e10s and would be nice to collect data in order to fix the 'real' bug.
Fix Landed on Version: 59
Risk to taking this patch (and alternatives if risky): low. We collect data and we use it to compose a string for the crash report.
String or UUID changes made by this patch:  none
Attachment #8927819 - Flags: approval-mozilla-esr52?
We've gotten 3 crashes so far, all Windows:
* bp-b8699baf-24a4-4e98-a24e-a94900171116
  Workers Hanging - A:2|S:0|Q:0-BC:0-BC:0
  Both active workers in WaitForWorkerEvents().  There's also one "DOM Worker" thread (thread 40) whose WorkerThreadPrimaryRunnable::Run() would seem to have successfully exited and is waiting in the normal thread event loop.
* bp-3e8a8d0c-c816-4606-bbdc-5915d0171116
  Workers Hanging - A:3|S:0|Q:0-BC:0-BC:0-BC:0
* bp-83ec8bb9-241b-4f88-832d-02ef40171116
  Workers Hanging - A:3|S:0|Q:0-BC:1-BC:0-BC:0
  The busy thread appears to be a ChromeWorker in js::ctypes code performing a CreateFileW call that's still pending on the underlying NtCreateFile call.
Crash Signature: mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] → mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [mozilla::dom::workers::RuntimeService::CrashIfHanging]
Crash Signature: mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [mozilla::dom::workers::RuntimeService::CrashIfHanging] → mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging]
Should this have been marked leave-open?
Flags: needinfo?(amarchesini)
The signature changed. But yes, let's keep it open.
Flags: needinfo?(amarchesini)
https://crash-stats.mozilla.com/report/index/b8699baf-24a4-4e98-a24e-a94900171116 is interesting:

1. The main-thread is spinning the event loop here: https://hg.mozilla.org/mozilla-central/annotate/f41930a869a8/dom/storage/StorageObserver.cpp#l405 - it's waiting for PBackground to nullify a couple of pointers.
2. But PBcackground thread is not processing runnables because it's blocked here: https://hg.mozilla.org/mozilla-central/annotate/f41930a869a8/dom/media/systemservices/CamerasParent.cpp#l200
3. VideoCapture thread is not processing the runnable because is blocked here https://hg.mozilla.org/mozilla-central/annotate/f41930a869a8/media/webrtc/trunk/webrtc/modules/video_capture/device_info_impl.cc#l34

jesup, can you take a look?

Note: I don't think this is _the_ reason why we have a Worker hanging, but it's important to investigate this deadlock. If you prefer I can file a separate bug.
Flags: needinfo?(rjesup)
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Target Milestone: mozilla59 → ---
Also https://crash-stats.mozilla.com/report/index/83ec8bb9-241b-4f88-832d-02ef40171116 shows a interesting behavior.
Here PBackground is blocked waiting for the shutting down of QuotaManager IO. QuotaManager IO thread busy doing... IO.
Janv, can you see if this is correct to happen on shutting down?
Flags: needinfo?(jvarga)
Munro, jib: this is relevant to munro's work on CamerasParent.  Looks like some form of deadlock.

I see no webrtc.org threads alive; the library seems shut down.

Looking at the code, line 146 of device-info_impl.cc appears to have at least one hole, though probably not the cause, in line 146 - the return -1; should have ReleaseLockExclusive()/AcquireLockShared() before it (like line 101).

That's an easy fix, and CreateCapabilityMap certainly can fail.  Worth fixing, and probably uplifting to 58.
Flags: needinfo?(rjesup)
Flags: needinfo?(mchiang)
Flags: needinfo?(jib)
Munro, can you take a look?
Flags: needinfo?(jib)
Create a separated bug 1418871 to trace comment 32.
Flags: needinfo?(mchiang)
See Also: → 1408629
I think this bug is related to the bugs generating crash signatures of [@ IPCError-browser | ShutDownKill ] with processor notes "Signature replaced with an IPC Channel Error, was ______.". 

I had the following crash reports generated while updating Firefox Nightly. 

Therefore, Bug 1405290 blocks Bug 1279293(IPCError_ShutDownKill) 
(Or could be changed to "See Also".)

And See Also Bug 1424451.

2017-12-10 2138pm Firefox Nightly crashed while updating.

Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ]
Firefox 59.0a1 Crash Report [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ]

MOZ_CRASH Reason 	Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken

These crash reports were generated at the same time, but were manually submitted a minute apart from each other.

bp-f25eb0e9-d35f-44c8-90f9-e5b7f0171211
	12/10/2017	8:46 PM
bp-40531424-889a-42b4-a8f2-7010a0171211
	12/10/2017	8:45 PM
bp-e236ffa2-0ab1-453b-bd87-8c2f80171211
	12/10/2017	8:44 PM
bp-0000c897-5a75-4397-9cd3-5edd10171211
	12/10/2017	8:43 PM
bp-1a1b6cf1-0022-4dd6-b8df-66f400171211
	12/10/2017	8:42 PM
bp-ab36ebe6-53f4-4fe0-ba77-b41ee0171211
	12/10/2017	8:41 PM
bp-4981d798-cfa1-45a8-8493-38e390171211
	12/10/2017	8:38 PM
	
bp-f25eb0e9-d35f-44c8-90f9-e5b7f0171211
	12/10/2017	8:46 PM
Firefox 59.0a1 Crash Report [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ]
Crash Reason 	EXCEPTION_BREAKPOINT

MOZ_CRASH Reason 	Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken

Crashing Thread (64), Name: Shutdown Hang Terminator
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	MOZ_CrashOOL 	mfbt/Assertions.cpp:33
1 	xul.dll 	mozilla::dom::workers::RuntimeService::CrashIfHanging() 	dom/workers/RuntimeService.cpp:2191
2 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog 	toolkit/components/terminator/nsTerminator.cpp:162
3 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397	
	
	
bp-40531424-889a-42b4-a8f2-7010a0171211
	12/10/2017	8:45 PM
Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ]
Crash Reason 	EXCEPTION_BREAKPOINT

Signature replaced with an IPC Channel Error, was: "style::rule_tree::StrongRuleNode::gc"

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	style::rule_tree::StrongRuleNode::gc 	servo/components/style/rule_tree/mod.rs:1061
1 	xul.dll 	geckoservo::glue::Servo_StyleSet_Drop 	servo/ports/geckolib/glue.rs:2415
2 	xul.dll 	mozilla::ServoStyleSet::Shutdown() 	layout/style/ServoStyleSet.cpp:180
3 	xul.dll 	mozilla::PresShell::Destroy() 	layout/base/PresShell.cpp:1360	
	
	
bp-e236ffa2-0ab1-453b-bd87-8c2f80171211
	12/10/2017	8:44 PM
Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ]
Crash Reason 	EXCEPTION_BREAKPOINT

Signature replaced with an IPC Channel Error, was: "RequestBehaviour::GetProgressTracker"

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	RequestBehaviour::GetProgressTracker() 	image/imgRequestProxy.cpp:95
1 	xul.dll 	RequestBehaviour::GetImage() 	image/imgRequestProxy.cpp:82
2 	xul.dll 	imgRequestProxy::RequestDiscard() 	image/imgRequestProxy.cpp:604
3 	xul.dll 	mozilla::dom::ImageTracker::RequestDiscardAll() 	dom/base/ImageTracker.cpp:157
	
	
bp-0000c897-5a75-4397-9cd3-5edd10171211
	12/10/2017	8:43 PM
Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ]
Crash Reason 	EXCEPTION_BREAKPOINT

Signature replaced with an IPC Channel Error, was: "RtlSafeRemoveEntryList | RtlDeleteCriticalSection | mozilla::detail::MutexImpl::~MutexImpl | mozilla::gfx::SourceSurfaceVolatileData::~SourceSurfaceVolatileData"

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	RtlSafeRemoveEntryList 	
1 	ntdll.dll 	RtlDeleteCriticalSection 	
2 	mozglue.dll 	mozilla::detail::MutexImpl::~MutexImpl() 	mozglue/misc/Mutex_windows.cpp:36
3 	xul.dll 	mozilla::gfx::SourceSurfaceVolatileData::~SourceSurfaceVolatileData() 	gfx/layers/SourceSurfaceVolatileData.h:101 
	
	
bp-1a1b6cf1-0022-4dd6-b8df-66f400171211
	12/10/2017	8:42 PM
Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ]
Crash Reason 	EXCEPTION_BREAKPOINT

Signature replaced with an IPC Channel Error, was: "servo_arc::Arc<T>::drop_slow<T>"

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	servo_arc::Arc<style::gecko_properties::ComputedValues>::drop_slow<style::gecko_properties::ComputedValues> 	servo/components/servo_arc/lib.rs:254
1 	xul.dll 	geckoservo::glue::Servo_Element_ClearData 	servo/ports/geckolib/glue.rs:943
2 	xul.dll 	mozilla::dom::Element::ClearServoData(nsIDocument*) 	dom/base/Element.cpp:4418	
	
	
	
bp-ab36ebe6-53f4-4fe0-ba77-b41ee0171211
	12/10/2017	8:41 PM
Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ]
Crash Reason 	EXCEPTION_BREAKPOINT

Signature replaced with an IPC Channel Error, was: "mozilla::widget::WinUtils::WaitForMessage | nsAppShell::ProcessNextNativeEvent"

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForMultipleObjects 	
1 	kernelbase.dll 	WaitForMultipleObjectsEx 	
2 	kernel32.dll 	WaitForMultipleObjectsExImplementation 	
3 	user32.dll 	RealMsgWaitForMultipleObjectsEx 	
	
	
bp-4981d798-cfa1-45a8-8493-38e390171211
	12/10/2017	8:38 PM	
Firefox 59.0a1 Crash Report [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ]
Crash Reason 	EXCEPTION_BREAKPOINT

User Comments: Firefox Nightly crashed while updating.

MOZ_CRASH Reason 	Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken

Crashing Thread (64), Name: Shutdown Hang Terminator
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	MOZ_CrashOOL 	mfbt/Assertions.cpp:33
1 	xul.dll 	mozilla::dom::workers::RuntimeService::CrashIfHanging() 	dom/workers/RuntimeService.cpp:2191
2 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog 	toolkit/components/terminator/nsTerminator.cpp:162
See Also: → 1424451
Hi Baku, should we still consider uplifting this to ESR52? I am just reviewing the uplift nom which you made in mid-Nov and I noticed that this helped but that there may be other crash sign that still need fixing. Also, do all 3 patches need to be landed in ESR52 or just the one you nom'd? Thanks!
Flags: needinfo?(amarchesini)
(In reply to Ritu Kothari (:ritu) from comment #38)
> Hi Baku, should we still consider uplifting this to ESR52?

You are right. We don't need it. We have enough reports.
Flags: needinfo?(amarchesini)
Attachment #8927819 - Flags: review+
Attachment #8927819 - Flags: approval-mozilla-esr52?
[@ mozilla::dom::workers::RuntimeService::CrashIfHanging] accounts for 25% of browser crashes on 59.0b2.
baku, can you take a look at the crashes in 59.0b ? 
This is showing up in early (Developer Edition) 59 releases as the top crash by far (as philipp points out). 
I'll also email Jan.
Flags: needinfo?(amarchesini)
Keywords: topcrash
Ok, I'll take a look too.
Flags: needinfo?(jvarga)
hm, by looking closer into crashes during shutdown progress and comparing the situation with 58.0b it appears that the [@ mozilla::dom::workers::RuntimeService::CrashIfHanging] signature may not be new crashes, but just "swallowing up" nearly all signatures that would have shown up as [@ shutdownhang...] in prior cycles.

this query would be comparing signatures during shutdown in 58.0b99 and 59.0b3:
https://mozilla.github.io/stab-crashes/scomp.html?limit=100&common=product%3DFirefox%26submitted_from_infobar%3D!__true__%26shutdown_progress%3D!__null__%26process_type%3Dbrowser&p1=version%3D58.0b99%26date%3D%3E2018-01-01&p2=version%3D59.0b3%26date%3D%3E2018-01-10

the share of crashes during shutdown compared to all other browser crashes doesn't seem to be worse in 59.0b compared to 58.0b. 

i'm not sure if summing up most shutdownhangs under one signature was the intended consequence of the patch here though...
OK, I'm not considering this a blocker for 59 betas or release, then.
I'm debugging this issue a lot lately. I would say that, often, checking the crash-reports, Firefox crashes because some component blocks the main-thread and the worker shutdown is not able to complete.

Some important improvement happened with the introduction of nsIThreadManager::spinEventLoopUntilOrShutdown, but there is a lot more to do. For instance:

https://crash-stats.mozilla.com/report/index/8e9a5c50-93c3-4718-971c-3a1290180131
Here https://hg.mozilla.org/releases/mozilla-beta/annotate/64737c752ac4/gfx/layers/ipc/ImageBridgeChild.cpp#l643 is blocking the main-thread. The workers are probably already gone, but RuntimeService doesn't receive the notification.

https://crash-stats.mozilla.com/report/index/95de2dc0-f706-4d34-b1a3-5fab10180131#allthreads
Also here we have a spinning of event loop when shutting down the transporter thread: https://hg.mozilla.org/releases/mozilla-beta/annotate/64737c752ac4/netwerk/base/nsIOService.cpp#l1122

https://crash-stats.mozilla.com/report/index/93e63bad-dfd4-4830-b523-770200180131#allthreads
Same here: https://hg.mozilla.org/releases/mozilla-beta/annotate/64737c752ac4/dom/media/gmp/GMPServiceParent.cpp#l302

Wondering if the solution is to move the shutting down of workers in a separate thread.
Flags: needinfo?(amarchesini)
Note that these 3 crash-reports are just the 3 most recent ones.
Attached patch will.patchSplinter Review
There are several issues here, but one of them is that workers shutting down doesn't start because a previous observer spins the event loop and hangs there.

Here a patch where the worker shutting down starts on NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID.
Attachment #8947091 - Flags: review?(bugs)
I'm lost here. Something landed two months ago, and now something else is about to land too?
Can we please open a new bug.
Comment on attachment 8947091 [details] [diff] [review]
will.patch

This patch as such should be fine. But please don't land in this bug.
It is really confusing to keep landing patches from the same bug over time.
Attachment #8947091 - Flags: review?(bugs) → review+
Depends on: 1434618
Just a side note, that there use to be a problem with quota manager shutdown and we solved it by introducing "profile-before-change-qm" notification (this one is fired after "profile-before-change". So any stuff that uses storage should listen for "profile-before-change" and then quota manager shutdowns at "profile-before-change-qm" when nothing should be initiating new storage requests.
(In reply to Jan Varga [:janv] from comment #52)
> Just a side note, that there use to be a problem with quota manager shutdown
> and we solved it by introducing "profile-before-change-qm" notification
> (this one is fired after "profile-before-change". So any stuff that uses
> storage should listen for "profile-before-change" and then quota manager
> shutdowns at "profile-before-change-qm" when nothing should be initiating
> new storage requests.

This is basically what I'm going here. But this is not the final fix because if some code blocks the main-thread, the worker runnables to the main-thread are not processed.
Depends on: 1422036
Here a report about the state of this crash:

There are mainly 2 reasons why we have this crash signature:
a. there is a bug in how worker code is used;
b. there is something blocking the worker shutdown: I see often components receiving the xpcom-shutdown notification and spinning the event loop. This event loop never ends, and workerinternals::RuntimeService doesn't receive the xpcom-shutdown notification. This means that the shutdown of workers, simply, doesn't start.

Related to the first point, I did some improvements. It seems that the main crash reports are related to 'fetch'. In bug 1422036 there is a important patch that should fix them.

About something blocking the worker shutdown, I'm working on bug 1434618 where, the shutting down of workers starts a bit before xpcom-shutdown. This should fix the spinning-event-loop issue that we see often. See bug 1435343.

The signature is now changed because I recently removed the 'workers' namespace. We should continue the discussion on bug 1435343.
See Also: → 1434189
Depends on: 1436995
Blocks: 1436995
No longer depends on: 1436995
Blocks: 1437575
Whiteboard: [tbird crash]
(In reply to Andrea Marchesini [:baku] from comment #54)
> Here a report about the state of this crash:
> 
> There are mainly 2 reasons why we have this crash signature:
> a. there is a bug in how worker code is used;
> b...
> 
> Related to the first point, I did some improvements. It seems that the main
> crash reports are related to 'fetch'. In bug 1422036 there is a important
> patch that should fix them.

bug 1422036 got uplifted to 59.0b8, but there wasn't any apparent decline in the [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] crash signature.
We don't have any additional reports where Fetch WorkerHolder keeps workers alive. Unfortunately was 10-15% of the hanging. The rest is related to bug dependences of bug 1435343.
I recently landed bug 1437575 on nightly that makes this crash report basically disappear.
This doesn't mean that the issue is fixed! What is behind this crash report is a component blocking the main-thread. With the main-thread blocked, workers are not able to complete the shutdown procedure and firefox hangs/crashes.
Before bug 1437575 we didn't have a way to detect this issue and it was shown as a worker hanging.

I filed bugs for each component blocking the main-thread, following the crash-reports. Here the list: 1435958, 1435960, 1435961, 1435962, 1435966, 1435963, 1435964
Several (such as the URLClassifier) appear to be due to using SYncRunnables/DISPATCH_SYNC from mainthread, probably because they need to wait for some other thread to do something during shutdown, and if they return from Observe the shutdown sequence will continue.

Instead, anything which needs to do other-thread actions or wait should use an AsyncShutdown blocker (nsIAsyncShutdownClient, etc) - you can see this  in action in C++ code in MediaManager.cpp.  Then you don't need to use a sync message from the shutdown observer/etc.
Crash numbers are very high for beta - our top crash. So this may cause problems on release (based on the user comments on these crash signatures)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #59)
> Crash numbers are very high for beta - our top crash. So this may cause
> problems on release (based on the user comments on these crash signatures)

This signature is going down in 60 for 2 reasons:
. it has been renamed to  mozilla::dom::workerinternals::RuntimeService::CrashIfHanging; And also this signature is going down quickly. See the graph in bug 1435343.
. bug 1437575 is landed to give more information of what is happening during shutdown;

Maybe we want to uplift bug 1437575 to know what is going on in beta.
Flags: needinfo?(lhenry)
Blocks: 1439893
Flags: needinfo?(lhenry)
(In reply to Andrew Overholt [:overholt] from comment #7)
> Can we get some common URLs people are hitting this with? I know it's a
> shutdown issue but without an STR or sites to reproduce it's hard to get
> anywhere here :)

2018-03-03> Firefox crashed while shutting down. Menu, Alt+F X (Exit). It took forever to shut down. A few minutes. The firefox process was not responding, and then in the suspended state (as per windows resource monitor). It eventually closed and generated this crash report.

---

During Shutdown these crash reports span a 4 minute period.

Unsubmitted Crash Reports
Report ID 	Date Crashed
e18ea154-fab4-4b76-95c3-a93d5d444724
	03/03/18	04:14
d634690f-dfce-4ee8-87d1-e5d56b23e723
	03/03/18	04:13
90c0b448-1a94-4cb3-b713-e90bdbcc7c8c
	03/03/18	04:12
da0e0ec9-8a7c-454b-bbf8-f2d7fd29258f
	03/03/18	04:12
dc8c503d-ade1-48a1-97e4-c6ea38ff6231
	03/03/18	04:12
ed32682c-8948-4be8-9493-27cc4093764b
	03/03/18	04:11
	
	Report ID 	Date Submitted
bp-5f249501-2f81-4923-8bc2-baa3b0180303
	03/03/18	04:51
bp-527a809c-9311-40a9-ace0-9c8880180303
	03/03/18	04:50
bp-5f0bf39d-b594-432d-a07f-cad250180303
	03/03/18	04:47
bp-ca5c3e00-f062-41cb-b7fb-e515d0180303
	03/03/18	04:46
bp-7b606d9f-5861-46c7-8c63-249340180303
	03/03/18	04:45
bp-de84a4f6-7020-490b-aaf7-2fab30180303
	03/03/18	04:35
	
---	

AvailablePageFile: 4170256384
AvailablePhysicalMemory: 2098941952
AvailableVirtualMemory: 8792240664576

Comments: Firefox crashed while shutting down. Menu, Alt+F X (Exit). It took forever to shut down. A few minutes. The firefox process was not responding, and then in the suspended state (as per windows resource monitor). It eventually closed and generated this crash report.

MozCrashReason: Workers Hanging - A:1|S:0|Q:0-BC:2Dispatch Error

ProductName: Firefox
ReleaseChannel: beta
SafeMode: 0
SecondsSinceLastCrash: 1759336
ShutdownProgress: xpcom-shutdown

TotalPageFile: 8273096704
TotalPhysicalMemory: 4080713728
TotalVirtualMemory: 8796092891136
URL: https://flightaware.com/live/airport_status_bigmap.rvt?airport=CYWG

Version: 59.0
(In reply to [:philipp] from comment #43)
> hm, by looking closer into crashes during shutdown progress and comparing
> the situation with 58.0b it appears that the [@
> mozilla::dom::workers::RuntimeService::CrashIfHanging] signature may not be
> new crashes, but just "swallowing up" nearly all signatures that would have
> shown up as [@ shutdownhang...] in prior cycles.


I agree. And I think this has been a problem for a few years now. 

It would be great to see it resolved. I think the users affected find it very annoying. I certainly do. 

Blocks: 1333409, 1368983, 1308629, 1401481, 1186276, 1419108, 1248837
See Also: 1346860, 1233757

Blocks:

Bug 1333409
Bad interaction between the shutdownhang kill timer and the ContentParent kill timer
NEW Unassigned

Bug 1368983
Increased parent process shutdown crash rate in Firefox 53.0.3. Crash in [@ shutdownhang | ____]
NEW Assigned to baku

Bug 1308629
Firefox goes Not Responding (or crashes) on Quit
NEW Unassigned

Bug 1401481
[Crash in shutdownhang | js::Thread::join] MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.)
NEW Unassigned

Bug 1186276
Firefox keeps crashing: shutdown hang in the storage service
NEW Unassigned

Bug 1419108
Crash in shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn

Bug 1248837
shutdownhang in nsThread::Shutdown() | Firefox crashes, sometimes the restart fails
REOPENED Unassigned


See Also:

Bug 1346860
Different signatures for same shutdownhang issues on release & beta
NEW Unassigned

Bug 1233757
Make shutdownhang crashes less annoying
NEW Unassigned
Blocks: 1415837
Blocks: 1424451
See Also: 1424451
I don't think we have anything actionable here for 59, which will release next week. 
Maybe we'll see some useful results from the diagnostic patch that will be useful for 60.
Nothing to do for 56. The fix landed in 60.
Assignee: amarchesini → nobody
Too late for Fx59 at this point, but definitely want to keep this on the radar for 60/61 still.
Interestingly, we're not seeing any hits for this on 60 on Beta so far. Any idea what might have fixed it?
Flags: needinfo?(amarchesini)
If this is fixed in 60, close this bug as resolved then.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #66)
> Interestingly, we're not seeing any hits for this on 60 on Beta so far. Any
> idea what might have fixed it?

On beta/nightly this issue is not fixed but just correctly reported. Let me explain it a bit more:

There are some components that spin the event loop on main-thread during the shutdown. This happens when shutdown notifications are emitted (xpcom-shutdown, xpcom-will-shutdown, xpcom-thread-shutdown and so on).
Sometimes, some of these components do not exit from this spinning and that means that other components do not receive the shutdown notifications.

In our hang monitor, there is a specific check about web workers to be sure that all the web workers are correctly terminated on shutdown. But if a component blocks the main-thread, spinning the event loop forever, the web workers shutdown doesn't start because notifications are not received.

So, the hang monitor reports this as a web worker error.

In bug 1437575 I improved the hang monitor reporting unterminated shutdown notifications. This showed that often, it was not a web worker issue but something else:  1434618, 1435958, 1435960, 1435961, 1435962, 1435966, 1418871, 1422036, 1435963, 1435964.

I filed, as you see, several bugs and I alerted the owners/peers of those components.
I also fixed a couple of web worker issues.

The situation now is better, but we can do more. In particular bug 1435958.

I mark this bug as dup of 1437575.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Flags: needinfo?(amarchesini)
Resolution: --- → DUPLICATE
Thanks for the explanation. Doesn't sound like this bug is going to be actionable for ESR52 either, which is unfortunate. Over 1000 crash reports in the last week there still too :(
Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
I could try to port some of the patches to ESR52, but note that all those patches do not fix the problem. They just expose it in a better way.
Flags: needinfo?(amarchesini)
No longer depends on: 1435958
Depends on: 1356853
Blocks: 1633342
No longer blocks: 1633342
You need to log in before you can comment on or make changes to this bug.