Bugzilla

Comment 2

•

7 years ago

This is a shutdown problem in workers. It's my top priority for this week.

Flags: needinfo?(amarchesini)

Comment 3

•

7 years ago

In this particular crash, the bug is not in workers. This is the crash message:

Workers Hanging - 0|A:3|S:0|Q:0-BC:0|...
                  ^
                  where, this 0, means that RuntimeService has not received the xpcom-shutdown notification yet.

This is happening because netwerk/cache2/CacheFileIOManager.cpp:4156 is blocking the main-thread doing some I/O.

jduell, can this operation be done on an I/O thread instead the main-one?

Flags: needinfo?(jduell.mcbugs)

Comment 4

•

7 years ago

Just to be clear, the worker shutdown happens when xpcom-shutdown is received, but the crash report shows that this operation is not started yet.

https://dxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1996

mShuttingDown is set here: https://dxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1852

by:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#2611-2614

Updated

•

7 years ago

Assignee: nobody → amarchesini

Priority: -- → P1

Jason Duell

Comment 5

•

7 years ago

Hoping Michal or Honza can answer comment #3

Flags: needinfo?(jduell.mcbugs) → needinfo?(michal.novotny)

Jason Duell

Updated

•

7 years ago

Flags: needinfo?(honzab.moz)

Comment 6

•

7 years ago

There are other related crashes. Here a few:

https://crash-stats.mozilla.com/report/index/b924b1db-a013-4b0a-a137-98f910180206#allthreads
here we crash because the main-thread is blocked by mozilla::net::nsSocketTransportService::ShutdownThread() that spins the event loop and never returns.

https://crash-stats.mozilla.com/report/index/7cd7e0ae-bba3-43ef-ae44-0ba890180206#allthreads
here netwerk/cache2/CacheFileIOManager.cpp:583 calls: mozilla::net::ShutdownEvent::PostAndWait()

https://crash-stats.mozilla.com/report/index/68ae56fb-016f-4ac1-be5c-d0acd0180206#allthreads
blocks main-thread with mozilla::net::CacheFileIOManager::SyncRemoveDir(nsIFile*, char const*)

https://crash-stats.mozilla.com/report/index/d8397a6e-923f-494b-a6f6-48f350180206#allthreads
maybe unrelated, but still necko: netwerk/protocol/http/nsHttpHandler.cpp:2766 spins the event loop and it doesn't return.

https://crash-stats.mozilla.com/report/index/83089e4b-77e7-4920-ae56-16d960180206#allthreads and
https://crash-stats.mozilla.com/report/index/a910bdd8-a605-4444-a2a6-d4d4f0180206#allthreads and
https://crash-stats.mozilla.com/report/index/73db45ac-e05c-4441-8a8f-1b5b00180206#allthreads
mozilla::net::nsHttpConnectionMgr::Shutdown() spins the event loop.

I recently landed a patch that starts the worker shutdown in xpcom-will-shutdown. This will improve the situation, but definitely, having a spin-event-loop when xpcom-shutdown notification is received, blocks other component to receive the same notification.

Updated

•

7 years ago

Depends on: 1435958

Updated

•

7 years ago

Depends on: 1435960

Updated

•

7 years ago

Depends on: 1435961

Comment 7

•

7 years ago

I'm filing separate bugs for each component blocking the main-thread on shutdown. Canceling the NIs here.

Flags: needinfo?(michal.novotny)

Flags: needinfo?(honzab.moz)

Updated

•

7 years ago

Depends on: 1435962

Updated

•

7 years ago

Depends on: 1435963

Updated

•

7 years ago

Depends on: 1435964

Updated

•

7 years ago

Depends on: 1435966

Comment 8

•

7 years ago

I also found:
https://treeherder.mozilla.org/logviewer.html#?job_id=160567328&repo=autoland&lineNumber=48584-48596

Which is related to bug 1411908. It also looks as it would be related to this problem. Andrea, could you please check?

Flags: needinfo?(amarchesini)

Comment 9

•

7 years ago

Here one more candidate:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=160035538&lineNumber=60357

Updated

•

7 years ago

Blocks: 1425323

Comment 10

•

7 years ago

> Which is related to bug 1411908. It also looks as it would be related to
> this problem. Andrea, could you please check?

You are right. This is related to bug 1435958. QuotaManager is blocking the main-thread.

Flags: needinfo?(amarchesini)

Cătălin Badea (:catalinb)

Comment 11

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #9)
> https://treeherder.mozilla.org/logviewer.
> html#?repo=autoland&job_id=160035538&lineNumber=60357

I assume a new bug needs to be filed for this case which is mozilla::dom::workerinternals::RuntimeService::Cleanup

Updated

•

7 years ago

Updated

•

7 years ago

Depends on: 1405290

Updated

•

7 years ago

Keywords: topcrash-thunderbird, topcrash-win

Comment 12

•

7 years ago

Firefox 60.0a1 Crash Report [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]
ID: a5168015-8edb-4e09-ab5d-ada430180211

Date Processed 	2018-02-11 04:55:59
Uptime 	9,984 seconds (2 hours, 46 minutes and 24 seconds)
Last Crash 	611,637 seconds before submission (1 week, 1 hour and 53 minutes)
Install Age 	400,199 seconds since version was first installed (4 days, 15 hours and 9 minutes)
Install Time 	2018-02-06 10:22:39

Release Channel 	nightly
Version 	60.0a1
Build ID 	20180205220102
OS 	Windows 7

MOZ_CRASH Reason 	Workers Hanging - 0|A:3|S:0|Q:0-BC:0|WorkerHolderToken|PerformanceStorageWorkerHolder-BC:0|WorkerHolderToken|PerformanceStorageWorkerHolder-BC:0|WorkerHolderToken|PerformanceStorageWorkerHolder

Total Virtual Memory 	8,796,092,891,136 bytes (8.8 TB)
Available Virtual Memory 	8,793,084,715,008 bytes (8.79 TB)
Available Page File 	5,065,678,848 bytes (5.07 GB)
Available Physical Memory 	2,159,378,432 bytes (2.16 GB)

Crashing Thread (62), Name: Shutdown Hang Terminator
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	MOZ_CrashOOL 	mfbt/Assertions.cpp:33
1 	xul.dll 	mozilla::dom::workerinternals::RuntimeService::CrashIfHanging() 	dom/workers/RuntimeService.cpp:2014
2 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog 	toolkit/components/terminator/nsTerminator.cpp:162
3 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
4 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:137
5 	ucrtbase.dll 	__crt_stdio_output::crop_zeroes(char*, __crt_locale_pointers* const) 	
6 		@0x1400a2 	
7 	ntdll.dll 	RtlUserThreadStart

Comment 13

•

7 years ago

(In reply to Trevor Skywalker from comment #12)
> Firefox 60.0a1 Crash Report [@
> mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]
> ID: a5168015-8edb-4e09-ab5d-ada430180211

Here an other example of something blocking the shutting down:

MOZ_CRASH Reason 	Workers Hanging - 0|A:3|S:0|Q:0-BC:0...
                                          ^

0 means that the xpcom-shutdown has not been received by RuntimeService.cpp yet.
The main-thread seems busy doing some JS stuff.

Comment 15

•

7 years ago

We should see a decreasing of the crash-report because of bug 1437575.

Depends on: 1437575

Comment 16

•

7 years ago

(In reply to Andrea Marchesini [:baku] from comment #15)
> We should see a decreasing of the crash-report because of bug 1437575.

Out of interest, could you explain why improved logging (as you mentioned on this other bug in the initial comment) makes it so that we do not see that many crashes anymore?

Flags: needinfo?(amarchesini)

Comment 17

•

7 years ago

In bug 1437575 I introduced a new crash message that is shown in case the shutdown steps are not completed yet after the internal timeout. If this happens, it means that a component is blocking the main-thread.

Because of this new crash message, we are not going to see mozilla::dom::workerinternals::RuntimeService::CrashIfHanging signature except if really, the hanging happens because of workers.

Flags: needinfo?(amarchesini)

Comment 18

•

7 years ago

Ok, so that just changes the log message, but wouldn't reduce the amount of possible crash reports. Just that this specific crash as covered by this bug won't happen that often anymore. 

Thanks, and I will keep an eye out for it.

Comment hidden (Intermittent Failures Robot)

16 failures in 685 pushes (0.023 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* autoland: 14
* mozilla-inbound: 2

Platform breakdown:
* linux64: 15
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1435343&startday=2018-02-12&endday=2018-02-18&tree=all

Updated

•

6 years ago

Keywords: topcrash-thunderbird

Comment 20

•

6 years ago

With the latest changes, these crash-reports dropped down. Can we reduce the priority to p2, maybe?

Flags: needinfo?(afarre)

Andreas Farre [:farre]

Updated

•

6 years ago

Flags: needinfo?(afarre)

Priority: P1 → P2

Updated

•

6 years ago

Depends on: 1445020

Updated

•

6 years ago

No longer depends on: 1435958

Updated

•

6 years ago

Depends on: 1411908

Updated

•

6 years ago

Depends on: 1356853

Updated

•

6 years ago

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::Spi…

Comment 22

•

6 years ago

FF44-58
[@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup]
Showing results from 7 days ago - 8,354 Results

FF59
[@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] 
Showing results from 7 days ago - 2,318 Results

FF60/61
[@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]
Showing results from 7 days ago - 221 Results 

-------------------------------

Top Crashers for Firefox 52.7.3esr
17  0.67% 	0.09% 	shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup	804 	804 	0 	0 	648 	0 	2015-10-31

Top Crashers for Firefox 58.0.2
24 	0.56% 	0.15% 	shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup	59 	59 	0 	0 	59 	0 	2015-10-31

Top Crashers for Firefox 59.0.2
8 	1.19% 	-0.06% 	mozilla::dom::workers::RuntimeService::CrashIfHanging	1933 	1827 	97 	9 	1951 	0 	2017-11-16 

Top Crashers for Firefox 60.0b
17 	0.58% 	-0.09% 	mozilla::dom::workerinternals::RuntimeService::CrashIfHanging	205 	186 	17 	2 	199 	0 	2018-02-01

Top Crashers for Firefox 61.0a1
52 	0.22% 	-0.07% 	mozilla::dom::workerinternals::RuntimeService::CrashIfHanging	15 	15 	0 	0 	15 	0 	2018-02-01

Crash Signature: mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] → mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHangin…

status-firefox59: --- → affected

status-firefox60: --- → affected

status-firefox61: --- → affected

status-firefox-esr52: --- → affected

Keywords: topcrash-win → nightly-community, topcrash

OS: Windows 10 → All

Hardware: Unspecified → All

Summary: Crash in mozilla::dom::workerinternals::RuntimeService::CrashIfHanging → Crash in [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]. Shutdown problem in workers.

Version: unspecified → 44 Branch

Ben Kelly [:bkelly, not reviewing]

Updated

•

6 years ago

Blocks: ServiceWorkers-stability

Comment 23

•

6 years ago

Trevor, what about this bug suggests its a service worker problem?  Just trying to figure out why you attached it to bug 1328631.

Flags: needinfo?(skywalker333)

Comment 24

•

6 years ago

(In reply to Ben Kelly [:bkelly] from comment #23)
> Trevor, what about this bug suggests its a service worker problem?  Just
> trying to figure out why you attached it to bug 1328631.

Sorry, it's not related to service workers.

Blocks: 988872
No longer blocks: ServiceWorkers-stability

Flags: needinfo?(skywalker333)

Updated

•

6 years ago

Crash Signature: mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHangin… → mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] …

Comment 26

•

6 years ago

update on status of this?

Flags: needinfo?(amarchesini)

Ryan VanderMeulen [:RyanVM]

Comment 27

•

6 years ago

Sort of. I'm waiting for a try-push result for bug 1434618. If it doesn't break any test, moving the shutdown of workers to xpcom-will-shutdown should improve this crash.

Flags: needinfo?(amarchesini)

Comment 28

•

6 years ago

This is currently the #3 top parent process crash on Beta61.

status-firefox59: affected → wontfix

status-firefox60: affected → wontfix

status-firefox62: --- → affected

status-firefox-esr52: affected → wontfix

status-firefox-esr60: --- → affected

Liz Henry (:lizzard) (relman/hg->git project)

Comment 29

•

6 years ago

Still the #3 top crash on beta 62.0b16. It's been in the top 10 for many releases so I'm marking it fix-optional for 62.

status-firefox61: affected → wontfix

status-firefox62: affected → fix-optional

status-firefox63: --- → affected

tracking-firefox63: --- → ?

Flags: needinfo?(mdaly)

Julien Cristau [:jcristau]

Comment 30

•

6 years ago

Andrew, while Baku is away can you give this a look and see if anything immediately apparent jumps out at you?

Flags: needinfo?(mdaly) → needinfo?(bugmail)

Comment 31

•

6 years ago

Per Marion, not tracking for 63.

tracking-firefox63: ? → -

Comment 32

•

6 years ago

Quick update, more to come as I investigate this over the next few days intermixed with other work:
- I'm taking over the bug since :baku is now primarily working on privacy engineering and tracking protection.
- The special crash reporting that tries to help us identify what's going on with the workers is reporting dispatch errors.  That likely wants to be fixed.  This wants to be investigated/fixed.
- Manual sampling of the crash reports involving mozilla::net::nsHttpConnectionMgr::Shutdown suggests that many of them don't actually have anything to do with worker shutdown.  However, some do seem to have workers around, so I wanted to script grabbing some tallies to provide directly actionable info to the necko team instead of just sweeping under the dirt under a bunch of other, smaller, rugs.

Assignee: amarchesini → bugmail

Status: NEW → ASSIGNED

Comment 34

•

6 years ago

removing NI and moving to DWS_NEXT

Flags: needinfo?(bugmail)

Updated

•

6 years ago

Whiteboard: DWS_NEXT

Updated

•

6 years ago

Assignee: bugmail → nobody

Status: ASSIGNED → NEW

Comment hidden (off-topic)

downloading about 400 mails IMAP (master password activated). 
updating TB daily and restart

Windows 10 Home (german) Build 17134.rs4_release-180410-1804 (64bit)
Thunderbird nightly 64.0a1 (64bit)
Signature: shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
======
This bug was filed from the Socorro interface and is
report bp-5520a419-714e-442a-a4c9-3daf70181207.
=============================================================

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	NtWaitForAlertByThreadId 	
1 	ntdll.dll 	RtlSleepConditionVariableSRW 	
2 	kernelbase.dll 	SleepConditionVariableSRW 	
3 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
4 	xul.dll 	mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent(bool, mozilla::EventPriority*) 	xpcom/threads/ThreadEventQueue.cpp:164
5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1159
6 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:530
7 	xul.dll 	nsThread::Shutdown() 	xpcom/threads/nsThread.cpp:934
8 	xul.dll 	mozilla::net::nsSocketTransportService::ShutdownThread() 	netwerk/base/nsSocketTransportService2.cpp:709
9 	xul.dll 	mozilla::net::nsIOService::SetOffline(bool) 	netwerk/base/nsIOService.cpp:1092
10 	xul.dll 	mozilla::net::nsIOService::Observe(nsISupports*, char const*, char16_t const*) 	netwerk/base/nsIOService.cpp:1390
11 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverList.cpp:111
12 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) 	xpcom/ds/nsObserverService.cpp:295
13 	xul.dll 	nsXREDirProvider::DoShutdown() 	toolkit/xre/nsXREDirProvider.cpp:1093
14 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1424
15 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4950
16 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:5014
17 	thunderbird.exe 	NS_internal_main(int, char**, char**) 	comm/mail/app/nsMailApp.cpp:311
18 	thunderbird.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:143
19 	thunderbird.exe 	static int __scrt_common_main_seh() 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:288
20 	kernel32.dll 	BaseThreadInitThunk 	
21 	ntdll.dll 	RtlUserThreadStart

Comment 36

•

6 years ago

> downloading about 400 mails IMAP (master password activated). 

Thunderbird crashes may be unrelated to this bug report, because Firefox crashes are very much on the decline[1], vs Thunderbird crashes which are on the rise [2].

[1] Firefox https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20nsThread%3A%3AShutdown%20%7C%20mozilla%3A%3Anet%3A%3AnsSocketTransportService%3A%3AShutdownThread&date=%3E%3D2018-06-08T12%3A58%3A32.000Z&date=%3C2018-12-08T12%3A58%3A32.000Z#graphs

[2] Thunderbird https://crash-stats.mozilla.com/signature/?product=Thunderbird&signature=shutdownhang%20%7C%20nsThread%3A%3AShutdown%20%7C%20mozilla%3A%3Anet%3A%3AnsSocketTransportService%3A%3AShutdownThread&date=%3E%3D2018-06-08T12%3A58%3A07.000Z&date=%3C2018-12-08T12%3A58%3A07.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#graphs

Updated

•

5 years ago

Crash Signature: ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] → ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()]

Comment hidden (Intermittent Failures Robot)

Updated

•

5 years ago

Crash Signature: ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()] → ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()] [@…

Comment hidden (Intermittent Failures Robot)

Updated

•

5 years ago

Keywords: regression

Comment hidden (Intermittent Failures Robot)

Comment hidden (off-topic)

Comment hidden (Intermittent Failures Robot)

Updated

•

5 years ago

Depends on: 1594572

Comment hidden (off-topic)

Comment hidden (Intermittent Failures Robot)

Comment hidden (off-topic)

Updated

•

4 years ago

Whiteboard: DWS_NEXT[stockwell unknown][topcrash-thunderbird] → [DWS_NEXT][stockwell unknown][tbird topcrash]

Assignee

Comment 68

•

4 years ago

I want to understand better, what all these signatures are about, so ni to myself.

Flags: needinfo?(jstutte)

Reporter

Comment 69

•

4 years ago

[@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ] is the number one crash signature for the April 15 Linux Nightlies. I don't know if somebody was just having a really bad day or what.

Reporter

Updated

•

4 years ago

Crash Signature: ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()] … → ] [@ mozilla::net::CacheFileIOManager::SyncRemoveDir ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]

Assignee

Comment 70

•

4 years ago

•

Edited

Digging a bit into the signatures, I see three main buckets of signatures:

Crashes at arbitrary places caused by MOZ_CRASH("Shutdown hanging before starting."); or by MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");

1.1 shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
Happens mostly on 68.x but has some occurrences also on 75.0.

1.2 shutdownhang | mozilla::net::ShutdownEvent::PostAndWait
Here 75.0 and 68.x are dominant, but the total volume is one order of magnitude lower than 1.1.

1.3 shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Here we have only versions up to 68.x. and mostly ESR. Volume is about half of 1.1.

1.4 shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown
Here 75 and 52.9.0esr dominate the ranks. I assume this to be just a variant of 1.3.

1.5 shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup
Happens only on versions up to 52.9.0esr and can be safely ignored.

All these crashes seem not really worker related to me (or I am overlooking something not evident), at least we do not know, what caused the hang.
Crashes with worker specific "Workers Hanging ..." MOZ_CRASH messages

2.1. mozilla::dom::workerinternals::RuntimeService::CrashIfHanging
(This signature has been added twice, it seems.)
Here we have a collection of many different (but similar) MOZ_CRASH reasons. I assume, they reflect the evolution of those messages through the different versions (as we can see versions back to 60.2.0esr here).
The 4 top scorers (making together more than 60%) are:

1 	Workers Hanging - 1|A:1|S:0|Q:0-BC:1|WorkerDebuggeeRunnable::mSender 	                 393 	21.30 %
2 	Workers Hanging - 1|A:1|S:0|Q:0-BC:0Dispatch Error 	                                 283 	15.34 %
3 	Workers Hanging - 1|A:3|S:0|Q:0-BC:0Dispatch Error-BC:0Dispatch Error-BC:0Dispatch Error 282 	15.28 %
4 	Workers Hanging - 1|A:1|S:0|Q:0-BC:1|IDBOpenDBRequest 	                                 165 	 8.94 %
5 	Workers Hanging - 1|A:2|S:0|Q:0-BC:0Dispatch Error-BC:0Dispatch Error 	                 141 	 7.64 %

. These are the cases to care (most) about in this bug, I think. It would be interesting to relate the different messages to the versions we see in order to narrow down similar causes.

. 2.2 mozilla::dom::workers::RuntimeService::CrashIfHanging
This signature happens on very old, unsupported versions only and can be safely ignored.

Signatures without MOZ_CRASH message at all

3.1 mozilla::net::CacheFileIOManager::SyncRemoveDir
Has a very low but recent volume. I think this signature deserves a bug on its own.

3.2 shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Has no occurrences at all in our data and can be removed from the signatures.

:mccr8, do I read well in the crash data and can we adjust a bit the signatures relevant for this bug?

(edit: it seems I am unable to format this well - hope it works anyway)

Flags: needinfo?(jstutte) → needinfo?(continuation)

Assignee

Comment 71

•

4 years ago

Looking into case 2.1. for "Dispatch Error" messages:

The "Dispatch Error" message is constructed here if and only if the Dispatch() returns false.

The first occasion to fail is the call to PreDispatch(mWorkerPrivate), which is a virtual function with many overrides.

Most implementations of that function just return true (some of them doing AssertIsOnMainThread(); some not) but there are four, that do more:

EventRunnable::PreDispatch
Despite its length, it returns always true (if it does not crash). So probably not relevant here.

WorkerDebuggeeRunnable::PreDispatch
Has a special behavior in case of ParentThreadUnchangedBusyCount, which can lead to false responses. This smells, as the busy count might be involved in determining pending workers? Interestinglee, WorkerDebuggeeRunnable has its own shutdown hang messages, too.

WorkerRunnable::PreDispatch
Here we have a special behavior in case of WorkerThreadModifyBusyCount which returns the result of aWorkerPrivate->ModifyBusyCount(true);. Again this smells.

NotifyRunnable::PreDispatch
Here we always return the result of aWorkerPrivate->ModifyBusyCount(true);. This smells even more with respect to the previous case?

Andrew, it might be just a gut feeling (ignoring the details of that code), but my impression is, that a PreDispatch returning false might provoke a shutdown hang (not) manipulating the BC correctly (in some cases)?

Flags: needinfo?(bugmail)

Assignee

Comment 72

•

4 years ago

BTW, I made a sheet with the "Workers Hanging ..." messages for 75.0. Note the very long messages for some WorkerDebuggeeRunnable::mSender which are caused by many WorkerRefs in WorkerPrivate. This looks suspicious, too.

Reporter

Comment 73

•

4 years ago

I don't know anything about worker shutdown, but your analysis sounds reasonable to me. It looks like baku fixed a bunch of issues back in 2018 when this was first filed, so it would make sense that some of the signatures might not be happening in recent versions. I'm not sure why he added the HTTP connection manager signatures to this bug.

Flags: needinfo?(continuation)

Assignee

Updated

•

4 years ago

Blocks: 1633342

Assignee

Updated

•

4 years ago

No longer blocks: 1633342

Assignee

Comment 74

•

4 years ago

Removed all signatures but case 2.1 from this bug, created bug 1633342 to collect the other (probably net related) signatures (and dropped the signatures with cases for unsupported versions only).

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::Spi… → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]

Assignee

Updated

•

4 years ago

Blocks: 1633342

Assignee

Comment 75

•

4 years ago

Moved the single dependencies to bug 1633342. Still I am not sure, if all these dependencies are real.

No longer blocks: 1633342

Depends on: 1633342
No longer depends on: 1356853, 1435961, 1435962, 1445020, 1594572

Assignee

Updated

•

4 years ago

Blocks: 1633342

No longer depends on: 1633342

Assignee

Comment 76

•

4 years ago

Not sure why bugzilla switched those dependencies.

No longer blocks: 1633342

Depends on: 1633342

Assignee

Updated

•

4 years ago

Comment 77

•

4 years ago

Expanding on comment 71, the relevant Runnable is a CrashIfHangingRunnable, whose PreDispatch always returns true, so that can't be the source of the failure. Going one step further into the relevant DispatchInternal, WorkerPrivate::DispatchControlRunnable is getting called. This function only fails if the worker's status is Dead, so that seems to be the source of the problem. This failure is possible because there's a race condition between a worker's status turning Dead and its removal from RuntimeService::mDomainMap. If we consider the simple case with a single worker, its removal happens here, and gets scheduled from here.
Long story short, it looks like the worker isn't really hanging, but rather the main thread doesn't run the Runnable that removes the record of the worker's existence. If we take a look at one of the relevant reports, we can see that the DOM Worker thread is indeed idle. It might be possible to reduce the number of reports that falsely attribute the hang to workers by removing the worker's entry from RuntimeService::mDomainMap at the same time its status changes to Dead, I have to look into it.

Yaron Tausky [:ytausky]

Updated

•

4 years ago

Assignee: nobody → ytausky

Yaron Tausky [:ytausky]

Updated

•

4 years ago

Depends on: 1636147

Updated

•

4 years ago

Flags: needinfo?(bugmail)

Assignee

Comment 78

•

4 years ago

Shall we expect then with the fix from bug 1636147, that all the crashes with DispatchError in the message (around 70%) would go away, leaving around only the ones with WorkerDebuggeRunnable::mSender ? That would be a great reduction!

Flags: needinfo?(ytausky)

Yaron Tausky [:ytausky]

Comment 79

•

4 years ago

Yes, that's the idea. Those messages indicate that the main thread is hanging, not the workers.

Flags: needinfo?(ytausky)

Comment hidden (off-topic)

(In reply to Robert Hartmann from comment #35)

downloading about 400 mails IMAP (master password activated).

updating TB daily and restart worked.

But now TB 78.0a1 , Build ID 20200521105808
crashed after closing TB main window, while IMAP folder synchronisation and mail indexing were not finished.
Did not send a mail, did not use LDAP-adressbook search this time.

bp-83f37711-5426-49bb-ab58-b2dfc0200523
Thunderbird 78.0a1 Crash Report [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ]

Crashing Thread (0)
Frame Module Signature Source Trust
0 ntdll.dll NtWaitForAlertByThreadId context
1 ntdll.dll RtlSleepConditionVariableSRW cfi
2 kernelbase.dll SleepConditionVariableSRW cfi
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) mozglue/misc/ConditionVariable_windows.cpp:50 cfi
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent(bool, mozilla::EventQueuePriority*, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator>) xpcom/threads/ThreadEventQueue.cpp:207 cfi
5 xul.dll nsThread::ProcessNextEvent(bool, bool) xpcom/threads/nsThread.cpp:1129 cfi
6 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:501 cfi
7 xul.dll nsThread::Shutdown() xpcom/threads/nsThread.cpp:891 cfi
8 xul.dll mozilla::net::nsSocketTransportService::ShutdownThread() netwerk/base/nsSocketTransportService2.cpp:792 cfi
9 xul.dll mozilla::net::nsIOService::SetOffline(bool) netwerk/base/nsIOService.cpp:1235 cfi
10 xul.dll mozilla::net::nsIOService::Observe(nsISupports*, char const*, char16_t const*) netwerk/base/nsIOService.cpp:1519 cfi
11 xul.dll nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverList.cpp:65 cfi
12 xul.dll nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverService.cpp:288 cfi
13 xul.dll nsXREDirProvider::DoShutdown() toolkit/xre/nsXREDirProvider.cpp:1027 cfi
14 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp:1272 cfi
15 xul.dll mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) mfbt/UniquePtr.h:302 cfi
16 xul.dll XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4788 cfi
17 xul.dll XRE_main(int, char**, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4825 cfi
18 thunderbird.exe NS_internal_main(int, char**, char**) comm/mail/app/nsMailApp.cpp:324 cfi
19 thunderbird.exe wmain(int, wchar_t**) toolkit/xre/nsWindowsWMain.cpp:131 cfi
20 thunderbird.exe __scrt_common_main_seh() /builds/worker/workspace/obj-build/comm/mail/app/f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:288 cfi
21 kernel32.dll BaseThreadInitThunk cfi
22 ntdll.dll RtlUserThreadStart cfi

Perry Jiang [:perry] [no longer employee, use ni?]

Assignee

Comment 81

•

4 years ago

(In reply to Jens Stutte [:jstutte] from comment #72)

BTW, I made a sheet with the "Workers Hanging ..." messages for 75.0. Note the very long messages for some WorkerDebuggeeRunnable::mSender which are caused by many WorkerRefs in WorkerPrivate. This looks suspicious, too.

(In reply to Jens Stutte [:jstutte] from comment #78)

Shall we expect then with the fix from bug 1636147, that all the crashes with DispatchError in the message (around 70%) would go away, leaving around only the ones with WorkerDebuggeRunnable::mSender ? That would be a great reduction!

It seems, that the so-far remaining cases on 78 are all carrying WorkerDebuggeeRunnable::mSender messages, which did not go away as predicted by Yaron and for which we have not yet a clear understanding, what is causing them.

Updated

•

4 years ago

Assignee: ytausky → perry

Assignee

Updated

•

4 years ago

Updated

•

4 years ago

Assignee: perry → nobody

Sylvestre Ledru [:Sylvestre]

Assignee

Updated

•

4 years ago

Updated

•

4 years ago

Depends on: 1664386

Updated

•

4 years ago

status-firefox62: fix-optional → wontfix

status-firefox63: affected → wontfix

status-firefox82: --- → affected

status-firefox83: --- → affected

status-firefox-esr60: affected → wontfix

status-firefox-esr78: --- → affected

Version: 44 Branch → unspecified

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Assignee

Comment 83

•

4 years ago

Looking at the first new crashes coming in from beta 83 with enhanced reporting as of bug 1664386 I see:

Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender

It may be early to say this definitely, but it seems, that the original assumption that we have chrome workers blocking us, is false (if we can trust the result of IsChromeWorker()).

As of bug 1664386 comment 1, this means that:

a) the (single) shutdown timeout has been reached by the RunWatchdog (active only on the parent process)
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)
d) the blocking worker is not (necessarily) a chrome worker

Asuth, Yaron, are we aware of any non-chrome worker that may run in the parent process?

Flags: needinfo?(ytausky)

Flags: needinfo?(bugmail)

Pascal Chevrel:pascalc

Updated

•

4 years ago

status-firefox83: affected → wontfix

status-firefox84: --- → affected

Pascal Chevrel:pascalc

Updated

•

4 years ago

status-firefox82: affected → wontfix

Comment hidden (Intermittent Failures Robot)

Updated

•

4 years ago

status-firefox84: affected → wontfix

status-firefox85: --- → affected

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Updated

•

4 years ago

Keywords: intermittent-failure

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox85: affected → wontfix

status-firefox86: --- → fix-optional

status-firefox-esr78: affected → wontfix

Marcela (Please NI request to Brindusa Tot)

Comment 88

•

3 years ago

I crashed with Thunderbird 90.0b3 on Mac during shutdown - not password related
bp-3ca8e432-e5b3-4c41-b9e3-2ba500210701
0 XUL mozilla::dom::workerinternals::RuntimeService::CrashIfHanging() dom/workers/RuntimeService.cpp:1708 context
1 XUL mozilla::(anonymous namespace)::RunWatchdog(void*) toolkit/components/terminator/nsTerminator.cpp:230 scan
2 libnss3.dylib _pt_root nsprpub/pr/src/pthreads/ptthread.c:201 scan
3 libsystem_pthread.dylib _pthread_start scan
4 libsystem_pthread.dylib thread_start scan

Updated

•

3 years ago

Whiteboard: [DWS_NEXT][stockwell unknown][tbird topcrash] → [DWS_NEXT][stockwell unknown][tbird topcrash],qa-not-actionable

Updated

•

3 years ago

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()]

Assignee

Comment 90

•

3 years ago

The variant with () is not happening any more.

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()] → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]

Reporter

Comment 91

•

3 years ago

FWIW, the () now get removed on crash stats, so they'll never show up in signatures.

Updated

•

3 years ago

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()]

Julien Cristau [:jcristau]

Reporter

Comment 92

•

3 years ago

I filed bugbug issues on it reverting a change and on it generating junk signatures.
https://github.com/mozilla/bugbug/issues/2540
https://github.com/mozilla/bugbug/issues/2541

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()] → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()]

Comment 93

•

3 years ago

IIRC the bot is just grabbing the signatures from the duplicate bugs.

Assignee

Comment 94

•

3 years ago

I adjusted the signature in the other bug accordingly, too.

Marco Castelluccio [:marco]

Assignee

Updated

•

3 years ago

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging()] → [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]

Comment 95

•

3 years ago

In this case it's not a bugbug-based change, so I moved the issues to the relman-auto-nag repository.

(In reply to Julien Cristau [:jcristau] from comment #93)

IIRC the bot is just grabbing the signatures from the duplicate bugs.

Yes, exactly. Let's discuss it in the issues.

Reporter

Comment 96

•

3 years ago

I removed the () from the dupes, and filed a bug on the TreeHerder intermittent filer, as I think that's actually where the () in signatures are from. Sorry for my confusion! I forgot about the duplicate bug signature thing.

Assignee

Comment 97

•

3 years ago

FWIW, in the most frequent case we still see any variation of recursion depth for:

Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender

Flags: needinfo?(bugmail)

Assignee

Updated

•

3 years ago

Flags: needinfo?(ytausky) → needinfo?(jstutte)

Assignee

Comment 98

•

2 years ago

Adjusting severity/priority based on the frequency.

Severity: critical → S3

Priority: P2 → P3

Comment 99

•

2 years ago

The severity field for this bug is set to S3. However, the bug has the topcrash keyword.
:jstutte, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash keyword?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Assignee

Comment 100

•

2 years ago

This is a top-crash only for Thunderbird, it seems. There it spiked up from version 91.7, it seems.

I am not sure what this would mean then for our severity/priority here.

Flags: needinfo?(jstutte)

Assignee

Updated

•

2 years ago

Keywords: topcrash

Assignee

Comment 101

•

2 years ago

So looking a bit at some crashes, we (still) mostly seem to have a problem with the WorkerDebuggeeRunnable here.

Looking at https://searchfox.org/mozilla-central/rev/da6a85e615827d353e5ca0e05770d8d346b761a9/dom/workers/WorkerPrivate.h#1245 I am wondering if we just never get a chance to execute the debugee runnable, being this a throttled event queue that targets the main thread (which is probably busy all the time during shutdown).

I am wondering if it is really a good idea to carry away a ThreadSafeWorkerRef here (which is a StrongWorkerRef) and if it wouldn't be better to downgrade this to a WeakWorkerRef ?

Flags: needinfo?(echuang)

Comment 102

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #100)

This is a top-crash only for Thunderbird, it seems. There it spiked up from version 91.7, it seems.

Indeed it is #3 crash for Thunderbird. I had marked up bug 1435961 for this.

The spike is false - it is the result of Thunderbird not having crash reporting on crash-stats from Nov 2021 to April 2022.

I am not sure what this would mean then for our severity/priority here.

I am wondering if it is really a good idea to carry away a ThreadSafeWorkerRef here (which is a StrongWorkerRef) and if it wouldn't be better to downgrade this to a WeakWorkerRef ?

Any idea if this would help Thunderbird crashes?

Flags: needinfo?(jstutte)

Assignee

Comment 103

•

2 years ago

The sole purpose of this mSender worker ref seems to be to keep the worker alive, as there is no real use of that variable. Downgrading it to a weak worker ref would be equivalent to removing it, at this point.

What is not clear to me is why we think we need to keep the worker alive? I assume we want to be sure to have still a living worker when we execute the runnable on the main thread (if we are on the worker thread, we are surely alive).

Looking at the sub-classes of WorkerDebuggeeRunnable it seems:

ServiceWorkerOpRunnable is only meant to run on the worker thread.
CheckScriptEvaluationWithCallback is only meant to run on the worker thread.
CompileScriptRunnable is only meant to run on the worker thread.

Instead:

MessageEventRunnable can be run on the main thread.
ReportErrorRunnable can be run on the main thread.
ReportGenericErrorRunnable can be run on the main thread
CancelingOnParentRunnable can be run on the main thread.

I assume if we move to a weak worker ref, those runnables should check the worker ref before doing anything with the WorkerPrivate* ?

Assignee

Comment 104

•

2 years ago

thunderbird

(In reply to Wayne Mery (:wsmwk) from comment #102)

Any idea if this would help Thunderbird crashes?

Well, the huge difference in numbers here seem to indicate, that Thunderbird's main thread loop is too busy to ever let the RefPtr<ThrottledEventQueue> mMainThreadDebuggeeEventTarget; event queue execute their events on the main thread. So apparently there is an issue on Thunderbird's side with being too busy on the main thread with whatever during worker shutdown.

But if we ensure that the worker can go away without harm before our WorkerDebuggeeRunnable ever executes on the main thread, that could definitely help, both Firefox & Thunderbird. But I need some more expertise from Eden here if this is not going to brake other things.

Flags: needinfo?(jstutte)

Assignee

Comment 105

•

2 years ago

Attached file Bug 1435343: Use a weak worker reference for WorkerDebuggeeRunnable. r?#dom-worker-reviewers (obsolete) — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → jstutte

Attachment #9276716 - Attachment description: WIP: Bug 1435343: Use a weak worker reference for WorkerDebuggeeRunnable. → Bug 1435343: Use a weak worker reference for WorkerDebuggeeRunnable. r?#dom-worker-reviewers

Status: NEW → ASSIGNED

Assignee

Comment 106

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #104)

(In reply to Wayne Mery (:wsmwk) from comment #102)

Any idea if this would help Thunderbird crashes?

Well, the huge difference in numbers here seem to indicate, that Thunderbird's main thread loop is too busy to ever let the RefPtr<ThrottledEventQueue> mMainThreadDebuggeeEventTarget; event queue execute their events on the main thread. So apparently there is an issue on Thunderbird's side with being too busy on the main thread with whatever during worker shutdown.

But if we ensure that the worker can go away without harm before our WorkerDebuggeeRunnable ever executes on the main thread, that could definitely help, both Firefox & Thunderbird.

To be clear: The patch here would not make block the worker shutdown by those runnables. Still on Thunderbird something frequently seems to prevent those runnables from ever being executed in time. Whatever this is, it still might cause a different flavor of hang after this patch lands. I would thus not be too optimistic that this patch fights those hangs, but it might help to have better diagnostic.

Canceling the ni? as I asked for review on the patch.

Flags: needinfo?(echuang)

Assignee

Comment 107

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #103)

The sole purpose of this mSender worker ref seems to be to keep the worker alive, as there is no real use of that variable. Downgrading it to a weak worker ref would be equivalent to removing it, at this point.

What is not clear to me is why we think we need to keep the worker alive? I assume we want to be sure to have still a living worker when we execute the runnable on the main thread (if we are on the worker thread, we are surely alive).

Looking at the sub-classes of WorkerDebuggeeRunnable it seems:

ServiceWorkerOpRunnable is only meant to run on the worker thread.

CheckScriptEvaluationWithCallback is only meant to run on the worker thread.

CompileScriptRunnable is only meant to run on the worker thread.

Instead:

MessageEventRunnable can be run on the main thread.

ReportErrorRunnable can be run on the main thread.

ReportGenericErrorRunnable can be run on the main thread

CancelingOnParentRunnable can be run on the main thread.

I assume if we move to a weak worker ref, those runnables should check the worker ref before doing anything with the WorkerPrivate* ?

So I fear things are a bit more complicated at least for the MessageEventRunnable. If a worker dispatches this kind of event, we must make sure it arrives somewhere. It seems as if keeping the worker alive was kind of a trick to ensure this. Actually I think the MessageEventRunnable should extract all needed information from the worker on dispatch in order to be able to deliver the event even when the worker ended?

Flags: needinfo?(echuang)

Assignee

Updated

•

2 years ago

Depends on: 1769913

Phabricator Automation

Comment 108

•

2 years ago

Comment on attachment 9276716 [details]
Bug 1435343: Use a weak worker reference for WorkerDebuggeeRunnable. r?#dom-worker-reviewers

Revision D146447 was moved to bug 1769913. Setting attachment 9276716 [details] to obsolete.

Attachment #9276716 - Attachment is obsolete: true

Assignee

Comment 109

•

2 years ago

Moved investigation to bug 1769913.

Flags: needinfo?(echuang)

Worcester12345

Comment 110

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #107)

(In reply to Jens Stutte [:jstutte] from comment #103)

The sole purpose of this mSender worker ref seems to be to keep the worker alive, as there is no real use of that variable. Downgrading it to a weak worker ref would be equivalent to removing it, at this point.

What is not clear to me is why we think we need to keep the worker alive? I assume we want to be sure to have still a living worker when we execute the runnable on the main thread (if we are on the worker thread, we are surely alive).

Looking at the sub-classes of WorkerDebuggeeRunnable it seems:

ServiceWorkerOpRunnable is only meant to run on the worker thread.

CheckScriptEvaluationWithCallback is only meant to run on the worker thread.

CompileScriptRunnable is only meant to run on the worker thread.

Instead:

MessageEventRunnable can be run on the main thread.

ReportErrorRunnable can be run on the main thread.

ReportGenericErrorRunnable can be run on the main thread

CancelingOnParentRunnable can be run on the main thread.

I assume if we move to a weak worker ref, those runnables should check the worker ref before doing anything with the WorkerPrivate* ?

So I fear things are a bit more complicated at least for the MessageEventRunnable. If a worker dispatches this kind of event, we must make sure it arrives somewhere. It seems as if keeping the worker alive was kind of a trick to ensure this. Actually I think the MessageEventRunnable should extract all needed information from the worker on dispatch in order to be able to deliver the event even when the worker ended?

Maybe having some sort of interim "check-in" thread between those two would work. Just thinking out loud here. It seems there are a lot of things happening, and they need to happen in a more organized manner overall.

Assignee

Comment 112

•

2 years ago

(In reply to Worcester12345 from comment #110)

Maybe having some sort of interim "check-in" thread between those two would work. Just thinking out loud here. It seems there are a lot of things happening, and they need to happen in a more organized manner overall.

Not sure I get the idea here.

Comment hidden (Intermittent Failures Robot)

Assignee

Comment 114

•

2 years ago

(In reply to Intermittent Failures Robot from comment #113)

For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1435343&startday=2022-09-05&endday=2022-09-11&tree=all

The things I see here here do not really seem to be related to the original bug?

Comment hidden (Intermittent Failures Robot)

Robert Hartmann

Comment 119

•

2 years ago

TB 102.3.2 (32bit) on Windows 10 64bit cashed after first closing TB main window and after that sending composed mail from open mail-editor window. (Master-Password active)

bp-7e21d996-0b93-4f56-8904-742c00221013
Thunderbird 102.3.2 Crash Report [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]

MOZ_CRASH Reason (Sanitized)

Crashing Thread (40), Name: Shutdown Hang Terminator
Frame 	Module 	Signature 	Source 	Trust
0 	xul.dll 	mozilla::dom::workerinternals::RuntimeService::CrashIfHanging() 	dom/workers/RuntimeService.cpp:1603 	context
1 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*) 	toolkit/components/terminator/nsTerminator.cpp:232 	cfi
2 	nss3.dll 	_PR_NativeRunThread(void*) 	nsprpub/pr/src/threads/combined/pruthr.c:399 	cfi
3 	nss3.dll 	pr_root(void*) 	nsprpub/pr/src/md/windows/w95thred.c:139 	cfi
4 	ucrtbase.dll 	thread_start<unsigned int (__stdcall*)(void*), 1> 		cfi
5 	kernel32.dll 	BaseThreadInitThunk 		cfi
6 	mozglue.dll 	patched_BaseThreadInitThunk(int, void*, void*) 	toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:572 	cfi
7 	ntdll.dll 	__RtlUserThreadStart 		cfi
8 	ntdll.dll 	_RtlUserThreadStart 		cfi

Comment hidden (Intermittent Failures Robot)

Assignee

Comment 129

•

2 years ago

•

Edited

The remaining intermittent failure instances seem unrelated to workers. See bug 1805147, for example.

Comment hidden (Intermittent Failures Robot)

Treeherder Bug Filer

Updated

•

2 years ago

Updated

•

1 year ago

Depends on: 1823391

Comment hidden (Intermittent Failures Robot)

Also https://crash-stats.mozilla.org/report/index/fe8b8258-7db3-4c48-9244-d9e3c0230323

Comment 137

•

1 year ago

Is https://crash-stats.mozilla.org/report/index/6ec67d36-0c11-4739-a2e4-6025b0230322 an example** of this bug, where the Thunderbird user was locked out by something password related, perhaps too many failed password attempts.

Or should we be filing these as Thunderbird bugs?

Flags: needinfo?(jstutte)

Comment 138

•

1 year ago

(In reply to Wayne Mery (:wsmwk) from comment #137)

Is https://crash-stats.mozilla.org/report/index/6ec67d36-0c11-4739-a2e4-6025b0230322 an example** of this bug, where the Thunderbird user was locked out by something password related, perhaps too many failed password attempts.

Also https://crash-stats.mozilla.org/report/index/fe8b8258-7db3-4c48-9244-d9e3c0230323

Or should we be filing these as Thunderbird bugs?

Meta for a sec: This bug exists primarily to track the Firefox shutdown hangs attributed to workers by way of the associated crash signature. These hangs are potentially attributable to:

Bugs in the core worker implementation. (Frequently technical debt related.)
Bugs in Web APIs exposed on workers.
Bugs in system JS code using workers, potentially involving a failure to pay attention to shutdown phases.
Bugs in system code related to shutdown phases, such as failing to generate appropriate shutdown phases or tear down content globals, etc.

For the specific crashes you identify above, it's very likely that bug 1800659 will address the (type 1) problem. Unfortunately, that's only going to be landing in v116 and it's unlikely we'll be able to uplift[1], which is very unfortunate for Thunderbird's model of building against ESR (and the Firefox ESR itself).

Because of Thunderbird building against ESR and because I think it's potentially difficult to distinguish type 3 and type 4 problems that are specific to Thunderbird until after performing a potentially detailed investigation, it's likely appropriate to file distinct Thunderbird bugs which can be marked as depending on platform bugs as appropriate.

For TB built against m-c trunk/beta/release the calculus changes a bit because type 1 and type 2 problems are more likely to be timely. However, I think it could still make sense to file distinct TB bugs because type 3 and type 4 factors are still potentially so significant and in the event we get users commenting on the bug, the potential for confusion goes up. Also, this simplifies bug prioritization since the product impacts may vary.

For the filed TB bugs where the TB team would like input from the workers team the best practices would probably be:

Try and make sure the bug is shovel ready by including:
- If available, the Workers Hanging string (as you've done above, thank you!). This should be available in crash reports (as protected data that is okay to report in bugs because we explicitly do not include any origin data, although any propagating the information should of course confirm there is nothing potentially identifying before pasting), and in debug builds where MOZ_ReportCrash does a printf.
- If stdout/wherever MOZ_LOG would go if enabled (which may be to MOZ_LOG_FILE) is available, any of the worker state information added in https://phabricator.services.mozilla.com/D173430 which is automatically emitted as MOZ_LOG category "WorkerShutdownDump" which gets temporarily force-enabled. The output looks like https://bugzilla.mozilla.org/show_bug.cgi?id=1805613#c83
- Links to any Thunderbird documentation about:
  - Its shutdown phases for content and system/app logic
  - Its use of workers for system/app logic. This should also include mention of any subsystems that might have previously been used in Firefox and m-c but which are no longer used (or maybe even present) in Firefox/m-c but have been forked into TB, etc.
Any context about extensions indicated by the crash-stats which might use the TB extension experiments mechanism that allows extensions to do all the legacy add-on stuff that Firefox is able to assume is no longer possible. My concern here would be add-ons that are creating workers and are not aware of shutdown phases as opposed to things the add-on would be doing in the worker since XPConnect is not exposed to workers and "ctypes" usage should show up in crash stacks (if active at the time of the crash).
Ask about the bug in the Workers & Storage chat.mozilla.org channel, pasting the bug link there. The rationale is that there really isn't 1 right person to answer questions about the factors that might contribute to shutdown hangs, especially as type 2 issues will be something that the core worker peers won't necessarily be directly aware of.
- That said, it might make sense for the TB team to designate a dev as the "worker liaison"/similar and so anyone triaging TB crashes/bugs in this space could needinfo the relevant TB dev.

1: It's likely that the bug 1800659 fixes will be a massive improvement for these specific hangs and would be appropriate for uplift on its own, but it also:

Represents a major shift in worker behavior that potentially will result in a number of fixes in other components and this would increase uplift risk because those fixes may result in their own cascade of fixes which could intertwine with new functionality, etc.
Is expected to be followed-up by a number of other worker technical debt paydown refactorings for which it's also not clear we could uplift. And arguably it would be better to leave ESR in the pre-bug 1800659 state that has been the equilibrium state for a long time rather than having ESR have a temporary intermediate equilibrium that might only exist in Fx116 or maybe not exist in any shipped Firefox if we land more refactorings in 116 that we definitely don't want to risk uplifting (I have a few of these...).

Flags: needinfo?(jstutte)

Comment hidden (Intermittent Failures Robot)

Updated

•

1 year ago

Blocks: 1843744

Depends on: 1800659

Comment 141

•

11 months ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #138)

...
1: It's likely that the bug 1800659 fixes will be a massive improvement for these specific hangs and would be appropriate for uplift on its own, but it also:

Represents a major shift in worker behavior that potentially will result in a number of fixes in other components and this would increase uplift risk because those fixes may result in their own cascade of fixes which could intertwine with new functionality, etc.

Is expected to be followed-up by a number of other worker technical debt paydown refactorings for which it's also not clear we could uplift. And arguably it would be better to leave ESR in the pre-bug 1800659 state that has been the equilibrium state for a long time rather than having ESR have a temporary intermediate equilibrium that might only exist in Fx116 or maybe not exist in any shipped Firefox if we land more refactorings in 116 that we definitely don't want to risk uplifting (I have a few of these...).

Thanks for that info. So indeed bug 1800659 is on version 116 and not back ported to esr115

Assignee

Comment 142

•

9 months ago

Removing the regression keyword, this is 6 years old and reasons for hangs are manifold and might have changed over time, anyways.

Keywords: regression

Assignee

Comment 143

•

9 months ago

(In reply to Wayne Mery (:wsmwk) from comment #141)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #138)

...
1: It's likely that the bug 1800659 fixes will be a massive improvement for these specific hangs and would be appropriate for uplift on its own, ...

Thanks for that info. So indeed bug 1800659 is on version 116 and not back ported to esr115

From the numbers from Firefox I cannot really see any improvement here for >=116.

Comment 144

•

3 months ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 20 desktop browser crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash