The default bug view has changed. See this FAQ.

Intermittent test_645970.js,test_logging.js,test_startup_session_sync.js,test_startup_nosession_sync.js | application crashed [@ mozalloc_abort(char const*)] from ASSERTION: thread pool wasn't shutdown

RESOLVED FIXED in Firefox 27

Status

()

Core
DOM: Workers
--
critical
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: philor, Assigned: Yoric)

Tracking

(Depends on: 1 bug, {assertion, crash, intermittent-failure})

Trunk
mozilla27
x86
All
assertion, crash, intermittent-failure
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox25 wontfix, firefox26 wontfix, firefox27 fixed, firefox-esr24 wontfix)

Details

Attachments

(1 attachment, 16 obsolete attachments)

2.06 KB, patch
Yoric
: review+
Details | Diff | Splinter Review
(Reporter)

Description

4 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=20090551&tree=Mozilla-Inbound
Ubuntu 12.04 mozilla-inbound debug test xpcshell on 2013-02-25 20:36:43 PST for push 70814e7dbcb8
slave: tst-linux32-ec2-056

21:46:56     INFO -  TEST-PASS | /builds/slave/test/build/tests/xpcshell/tests/toolkit/components/search/tests/xpcshell/test_645970.js | [run_test : 23] [xpconnect wrapped nsISearchEngine @ 0x9e367e8 (native @ 0x9e365b8)] != null
21:46:56     INFO -  *** Search: SRCH_SVC_saveSortedEngineList: starting
21:46:56     INFO -  TEST-INFO | (xpcshell/head.js) | test 1 finished
21:46:56     INFO -  TEST-INFO | (xpcshell/head.js) | exiting test
21:46:56     INFO -  TEST-PASS | (xpcshell/head.js) | 1 (+ 1) check(s) passed
21:46:56     INFO -  TEST-INFO | (xpcshell/head.js) | 0 check(s) todo
21:46:56     INFO -  WARNING: nsExceptionService ignoring thread destruction after shutdown: file ../../../xpcom/base/nsExceptionService.cpp, line 167
21:46:56     INFO -  ###!!! ASSERTION: thread pool wasn't shutdown: '!mPool', file ../../../../netwerk/base/src/nsStreamTransportService.cpp, line 435
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x003425C9]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x00342674]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x003F948F]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x015C46E6]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x01608089]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x016050B1]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x015CA3AF]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x01605B47]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x01605BB4]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/libxul.so +0x015CFB67]
21:46:56     INFO -  NS_ShutdownXPCOM+0x0000001A [/builds/slave/test/build/application/firefox/libxpcom.so +0x00001DD6]
21:46:56     INFO -  UNKNOWN [/builds/slave/test/build/application/firefox/xpcshell +0x00009388]
21:46:56     INFO -  ###!!! ASSERTION: thread pool wasn't shutdown: '!mPool', file ../../../../netwerk/base/src/nsStreamTransportService.cpp, line 435
21:46:56     INFO -  <<<<<<<
21:47:16     INFO -  PROCESS-CRASH | /builds/slave/test/build/tests/xpcshell/tests/toolkit/components/search/tests/xpcshell/test_645970.js | application crashed [@ mozalloc_abort(char const*)]
21:47:16     INFO -  Crash dump filename: /builds/slave/test/build/tests/xpcshell/tests/toolkit/components/search/tests/xpcshell/24d47b23-07e3-f49d-4b785bfa-18f81c59.dmp
21:47:16     INFO -  Operating system: Linux
21:47:16     INFO -                    0.0.0 Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686
21:47:16     INFO -  CPU: x86
21:47:16     INFO -       GenuineIntel family 6 model 26 stepping 5
21:47:16     INFO -       1 CPU
21:47:16     INFO -  Crash reason:  SIGSEGV
21:47:16     INFO -  Crash address: 0x0
21:47:16     INFO -  Thread 0 (crashed)
21:47:16     INFO -   0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp : 30 + 0x0]
21:47:16     INFO -      eip = 0xb7722f0d   esp = 0xbfa13ec0   ebp = 0xbfa13ed8   ebx = 0xb7724120
21:47:16     INFO -      esi = 0xb4868d9c   edi = 0xbfa13f08   eax = 0x0000000a   ecx = 0xffffffff
21:47:16     INFO -      edx = 0xb48698ac   efl = 0x00210282
21:47:16     INFO -      Found by: given as instruction pointer in context
21:47:16     INFO -   1  libxul.so!NS_DebugBreak_P [nsDebugImpl.cpp:70814e7dbcb8 : 422 + 0x5]
21:47:16     INFO -      eip = 0xb5ffd692   esp = 0xbfa13ee0   ebp = 0xbfa14318
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   2  libxul.so!nsStreamTransportService::~nsStreamTransportService [nsStreamTransportService.cpp:70814e7dbcb8 : 435 + 0x23]
21:47:16     INFO -      eip = 0xb4d265c9   esp = 0xbfa14320   ebp = 0xbfa14348
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   3  libxul.so!nsStreamTransportService::Release() [nsStreamTransportService.cpp:70814e7dbcb8 : 457 + 0x5]
21:47:16     INFO -      eip = 0xb4d26674   esp = 0xbfa14350   ebp = 0xbfa14378
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   4  libxul.so!nsCOMPtr_base::assign_assuming_AddRef(nsISupports*) [nsCOMPtr.h:70814e7dbcb8 : 442 + 0x7]
21:47:16     INFO -      eip = 0xb4ddd48f   esp = 0xbfa14380   ebp = 0xbfa143a8
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   5  libxul.so!nsCOMPtr_base::assign_with_AddRef(nsISupports*) [nsCOMPtr.cpp:70814e7dbcb8 : 49 + 0x8]
21:47:16     INFO -      eip = 0xb5fa86e6   esp = 0xbfa143b0   ebp = 0xbfa143c8
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   6  libxul.so!FreeFactoryEntries [nsCOMPtr.h:70814e7dbcb8 : 947 + 0x9]
21:47:16     INFO -      eip = 0xb5fec089   esp = 0xbfa143d0   ebp = 0xbfa143e8
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   7  libxul.so!nsBaseHashtable<nsIDHashKey, nsFactoryEntry*, nsFactoryEntry*>::s_EnumReadStub(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*) [nsBaseHashtable.h:70814e7dbcb8 : 400 + 0x2]
21:47:16     INFO -      eip = 0xb5fe90b1   esp = 0xbfa143f0   ebp = 0xbfa14408
21:47:16     INFO -      Found by: previous frame's frame pointer
21:47:16     INFO -   8  libxul.so!PL_DHashTableEnumerate [pldhash.cpp:70814e7dbcb8 : 717 + 0x11]
21:47:16     INFO -      eip = 0xb5fae3af   esp = 0xbfa14410   ebp = 0xbfa14458
21:47:16     INFO -      Found by: previous frame's frame pointer
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
josh, jduell: this is one of our most frequent intermittent orange bugs. We're tripping an assertion in nsStreamTransportService:
###!!! ASSERTION: thread pool wasn't shutdown: '!mPool', file ../../../../netwerk/base/src/nsStreamTransportService.cpp, line 435

in this xpcshell test:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/search/tests/xpcshell/test_645970.js

That test is manually sending out a "quit-application" observer service notification, so I wonder if it's screwing up nsStreamTransportService's expectations around shutdown ordering?

Can you take a look at this (or find someone else who can)?
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Josh: can you get someone to take a look at this? See comment 103 for my brief synopsis.
Flags: needinfo?(joshmoz)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Test disabled for Linux debug for now, until we get get responses to comment 103 / comment 115):
https://hg.mozilla.org/integration/mozilla-inbound/rev/509b300bda68
Whiteboard: [test disabled on Linux debug][leave open]
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Comment 176

4 years ago
Looking into an owner now.
Flags: needinfo?(joshmoz)

Updated

4 years ago
Flags: needinfo?(joshmoz)
https://hg.mozilla.org/mozilla-central/rev/509b300bda68
Comment hidden (Treeherder Robot)

Updated

4 years ago
Flags: needinfo?(joshmoz)
Comment hidden (Treeherder Robot)
nsStreamTransportService gets first instantiated after NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID.


When doing check-interactive of the failing test, after test has run and I type quit() I'm getting the following stack (windows):

 	xul.dll!nsStreamTransportService::Init()  Line 441	C++
 	xul.dll!nsStreamTransportServiceConstructor(nsISupports * aOuter=0x00000000, const nsID & aIID={...}, void * * aResult=0x0034f568)  Line 67 + 0x79 bytes	C++
 	xul.dll!mozilla::GenericFactory::CreateInstance(nsISupports * aOuter=0x00000000, const nsID & aIID={...}, void * * aResult=0x0034f568)  Line 16 + 0x14 bytes	C++
 	xul.dll!nsComponentManagerImpl::CreateInstance(const nsID & aClass={...}, nsISupports * aDelegate=0x00000000, const nsID & aIID={...}, void * * aResult=0x0034f568)  Line 1009 + 0x25 bytes	C++
 	xul.dll!nsComponentManagerImpl::GetService(const nsID & aClass={...}, const nsID & aIID={...}, void * * result=0x0034f5c0)  Line 1257 + 0x34 bytes	C++
 	xul.dll!CallGetService(const nsID & aCID={...}, const nsID & aIID={...}, void * * aResult=0x0034f5c0)  Line 52	C++
 	xul.dll!nsGetServiceByCIDWithError::operator()(const nsID & aIID={...}, void * * aInstancePtr=0x0034f5c0)  Line 234 + 0x13 bytes	C++
 	xul.dll!nsCOMPtr<nsIStreamTransportService>::assign_from_gs_cid_with_error(const nsGetServiceByCIDWithError & gs={...}, const nsID & aIID={...})  Line 1193 + 0xf bytes	C++
 	xul.dll!nsCOMPtr<nsIStreamTransportService>::nsCOMPtr<nsIStreamTransportService>(const nsGetServiceByCIDWithError & gs={...})  Line 581	C++
 	xul.dll!nsInputStreamPump::AsyncRead(nsIStreamListener * listener=0x0427aa98, nsISupports * ctxt=0x00000000)  Line 307	C++
 	xul.dll!nsBaseChannel::BeginPumpingData()  Line 223 + 0x3a bytes	C++
 	xul.dll!nsBaseChannel::AsyncOpen(nsIStreamListener * listener=0x04a6c988, nsISupports * ctxt=0x02b61480)  Line 600 + 0xb bytes	C++
>	xul.dll!`anonymous namespace'::ScriptLoaderRunnable::RunInternal()  Line 324 + 0x34 bytes	C++
 	xul.dll!`anonymous namespace'::ScriptLoaderRunnable::Run()  Line 159 + 0xb bytes	C++
 	xul.dll!nsThread::ProcessNextEvent(bool mayWait=false, bool * result=0x0034f7ff)  Line 627 + 0x19 bytes	C++
 	xul.dll!NS_ProcessPendingEvents_P(nsIThread * thread=0x01d10750, unsigned int timeout=4294967295)  Line 188 + 0x14 bytes	C++
 	xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager * servMgr=0x00000000)  Line 564	C++
 	xul.dll!NS_ShutdownXPCOM_P(nsIServiceManager * servMgr=0x00000000)  Line 519 + 0x9 bytes	C++
 	xpcom.dll!NS_ShutdownXPCOM(nsIServiceManager * svcMgr=0x00000000)  Line 134 + 0xa bytes	C++
 	xpcshell.exe!NS_internal_main(int argc=24, char * * argv=0x01c4f6bc, char * * envp=0x01c431b8)  Line 1973 + 0x8 bytes	C++
 	xpcshell.exe!wmain(int argc=31, wchar_t * * argv=0x01cfbc68)  Line 109 + 0x16 bytes	C++


This is just before NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID is fired.  I can also see the following stack (events fired from inside call to NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID notification:



>	xul.dll!`anonymous namespace'::ScriptLoaderRunnable::RunInternal()  Line 243	C++
 	xul.dll!`anonymous namespace'::ScriptLoaderRunnable::Run()  Line 159 + 0xb bytes	C++
 	xul.dll!nsThread::ProcessNextEvent(bool mayWait=true, bool * result=0x0034f53b)  Line 627 + 0x19 bytes	C++
 	xul.dll!NS_ProcessNextEvent_P(nsIThread * thread=0x01d10750, bool mayWait=true)  Line 238 + 0x17 bytes	C++
 	xul.dll!mozilla::dom::workers::RuntimeService::Cleanup()  Line 1192 + 0xe bytes	C++
 	xul.dll!mozilla::dom::workers::RuntimeService::Observe(nsISupports * aSubject=0x00000000, const char * aTopic=0x556e7cdc, const wchar_t * aData=0x00000000)  Line 1417	C++
 	xul.dll!nsObserverList::NotifyObservers(nsISupports * aSubject=0x00000000, const char * aTopic=0x556e7cdc, const wchar_t * someData=0x00000000)  Line 100	C++
 	xul.dll!nsObserverService::NotifyObservers(nsISupports * aSubject=0x00000000, const char * aTopic=0x556e7cdc, const wchar_t * someData=0x00000000)  Line 164	C++
 	xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager * servMgr=0x00000000)  Line 570	C++
 	xul.dll!NS_ShutdownXPCOM_P(nsIServiceManager * servMgr=0x00000000)  Line 519 + 0x9 bytes	C++
 	xpcom.dll!NS_ShutdownXPCOM(nsIServiceManager * svcMgr=0x00000000)  Line 134 + 0xa bytes	C++
 	xpcshell.exe!NS_internal_main(int argc=24, char * * argv=0x01c4f6bc, char * * envp=0x01c431b8)  Line 1973 + 0x8 bytes	C++
 	xpcshell.exe!wmain(int argc=31, wchar_t * * argv=0x01cfbc68)  Line 109 + 0x16 bytes	C++


If there is some specific change in order of observers to NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID on Ubuntu or it's just a timing issue of some kind, I don't know.

I don't see any bug that would be landed around the time this started to be so intensive, maybe except Bug 822399 (cc Olli).  That is a wild theory only.


This is definitely not caused by some network change.
(In reply to Honza Bambas (:mayhemer) from comment #180)
> When doing check-interactive of the failing test, after test has run and I
> type quit() I'm getting the following stack (windows):

I've seen an issue which looked imilar to this one while working on a different bug. That issue was fixed in bug 835730 and boiled down to this: nsBrowserGlue was tearing down some stuff when receiving an xpcom-shutdown event; those components were initialized during profile loading and in some tests this wasn't happening. In the end various shutdown functions were called on components that had never been initialized leading to an error very much like this one when manually calling _quit() interactively in the affected xpcshell test.

Since the problem might be something similar I'll have a look later today and see if can track it down.
Thanks, Gabriele.
I've gathered a few more clues about this failure and they point in a direction similar my previous experience:

- The assertion is triggered by executing a script (osfile_async_worker.js) as part of an xpcom-shutdown observer after ShutdownXPCOM() has been called

- This is causing a ScriptLoaderRunnable class to be instanced three times, it is from here that the nsStreamTransportService code is being called. The first time it runs, the |mPool| field is non-null and everything goes fine, this is during ShutdownXPCOM(). The second and third times it's in the "xpcom-shutdown" observer and that's where the assertion is being triggered

- In every case this is happening in a DOM worker thread, this is created here as part of the test:

https://mxr.mozilla.org/mozilla-central/source/toolkit/components/search/nsSearchService.js#2708

      let promise = OS.File.writeAtomic(path, data, { tmpPath: path + ".tmp"});

So my diagnose is that some code is hooking the "xpcom-shutdown" event and running some code that has already been run once cleaning up after itself so the following runs are causing the issue. The culprit seems to lie somewhere in the OS.File module so that's where I'll be looking next.
https://tbpl.mozilla.org/php/getParsedLog.php?id=21505860&tree=Mozilla-Inbound
Summary: Intermittent test_645970.js | application crashed [@ mozalloc_abort(char const*)] from ASSERTION: thread pool wasn't shutdown → Intermittent test_645970.js,test_logging.js | application crashed [@ mozalloc_abort(char const*)] from ASSERTION: thread pool wasn't shutdown
https://tbpl.mozilla.org/php/getParsedLog.php?id=21476508&tree=Mozilla-Inbound
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Comment 192

4 years ago
It looks like enabling the ECMAScript Internationalization API (part of SpiderMonkey, uses the big third-party ICU library) may make this failure permanent instead of intermittent on Windows - I see two complete sets of failures on two try runs:
April 9: https://tbpl.mozilla.org/?tree=Try&rev=4cc524a3f8ae
April 8: https://tbpl.mozilla.org/?tree=Try&rev=1fc6411a138d

I don't see these failures on earlier try runs:
April 4: https://tbpl.mozilla.org/?tree=Try&rev=7aa110e2dc60
so something else must have happened between April 4 and April 8 (the only differences on my side are restructured configure/make files).

To reproduce the failures, apply patches b2470ed032d1 and edfeec4def28 of the first-listed try run to current m-c.
(In reply to Gabriele Svelto [:gsvelto] from comment #183)
> I've gathered a few more clues about this failure and they point in a
> direction similar my previous experience:
> 
> - The assertion is triggered by executing a script (osfile_async_worker.js)
> as part of an xpcom-shutdown observer after ShutdownXPCOM() has been called
> 
> - This is causing a ScriptLoaderRunnable class to be instanced three times,
> it is from here that the nsStreamTransportService code is being called. The
> first time it runs, the |mPool| field is non-null and everything goes fine,
> this is during ShutdownXPCOM(). The second and third times it's in the
> "xpcom-shutdown" observer and that's where the assertion is being triggered
> 
> - In every case this is happening in a DOM worker thread, this is created
> here as part of the test:
> 
> https://mxr.mozilla.org/mozilla-central/source/toolkit/components/search/
> nsSearchService.js#2708
> 
>       let promise = OS.File.writeAtomic(path, data, { tmpPath: path +
> ".tmp"});
> 
> So my diagnose is that some code is hooking the "xpcom-shutdown" event and
> running some code that has already been run once cleaning up after itself so
> the following runs are causing the issue. The culprit seems to lie somewhere
> in the OS.File module so that's where I'll be looking next.

There is some code hooking web-workers-shutdown in OS.File: 
http://dxr.mozilla.org/mozilla-central/toolkit/components/osfile/osfile_async_front.jsm#l152

Could this be the cause?

Updated

4 years ago
Blocks: 853301
(In reply to Norbert Lindenberg from comment #192)
> It looks like enabling the ECMAScript Internationalization API (part of
> SpiderMonkey, uses the big third-party ICU library) may make this failure
> permanent instead of intermittent on Windows [...]

Interesting, I hadn't tried on Windows yet but I could reproduce it fairly consistently on Linux already.

(In reply to David Rajchenbach Teller [:Yoric] from comment #193)
> There is some code hooking web-workers-shutdown in OS.File: 
> http://dxr.mozilla.org/mozilla-central/toolkit/components/osfile/
> osfile_async_front.jsm#l152
> 
> Could this be the cause?

It might, thanks for the heads up.
Comment hidden (Treeherder Robot)

Comment 196

4 years ago
It seems to me that there are two separate bugs - the failure logs above are always either Linux/test_645970.js or Windows/test_logging.js, never Linux/test_logging.js or Windows/test_645970.js. The ones I've seen in my try runs were always of the Windows/test_logging.js variant. Note that test_logging.js is a new test which reached mozilla-central only on April 5.

Comment 197

4 years ago
Created attachment 736684 [details] [diff] [review]
Made test_logging.js correctly ignore messages it wants to ignore

Partial fix.
Attachment #736684 - Flags: review?(jonathan.laver)
Comment on attachment 736684 [details] [diff] [review]
Made test_logging.js correctly ignore messages it wants to ignore

Review of attachment 736684 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.
Comment hidden (Treeherder Robot)
Comment on attachment 736684 [details] [diff] [review]
Made test_logging.js correctly ignore messages it wants to ignore

Chiming in: rabbidrabbit doesn't have review rights on this part of the code, although he authored it.
Attachment #736684 - Flags: review?(jonathan.laver) → feedback?(jonathan.laver)
Comment on attachment 736684 [details] [diff] [review]
Made test_logging.js correctly ignore messages it wants to ignore

Review of attachment 736684 [details] [diff] [review]:
-----------------------------------------------------------------

Patch looks like it will fix something. I do not understand the cause of the crash well enough to know whether this will be sufficient, though.
Attachment #736684 - Flags: review+
(In reply to Gabriele Svelto [:gsvelto] from comment #183)
> - In every case this is happening in a DOM worker thread, this is created
> here as part of the test:
> 
> https://mxr.mozilla.org/mozilla-central/source/toolkit/components/search/
> nsSearchService.js#2708
> 
>       let promise = OS.File.writeAtomic(path, data, { tmpPath: path +
> ".tmp"});

Do I understand correctly that this code is executed upon shutdown?
If so, we have a weird race condition, because this is code that is meant to be executed during initialization.
Flags: needinfo?(jwalden+bmo)
(In reply to David Rajchenbach Teller [:Yoric] from comment #202)
> Do I understand correctly that this code is executed upon shutdown?
> If so, we have a weird race condition, because this is code that is meant to
> be executed during initialization.

No, it's more complicated and re-reading my comment I didn't describe the mechanism very well :-) This code only triggers the creation of a DOM Worker thread; then during shutdown this thread goes on to execute another script which is causing the problem. I was able to analyze the problem better and I found where the whole chain starts, it's here in RuntimeService::Cleanup():

https://mxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1127

Which is being called by an "xpcom-shutdown" observer (which is what I first saw when testing):

https://mxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1419

So for some reason the following is happening:

- OS.File starts up, creates a DOM worker and registers the "web-workers-shutdown" event
- During shutdown the "xpcom-shutdown" event is dispatched
- RuntimeService picks it up, calls Cleanup() which sends "web-workers-shutdown"
- osfile_async_front.jsm:webWorkersShutdownObserver() is invoked in response to the event
- This sends a "System_shutdown" message to the worker
- The shutdown code in osfile_async_worker.js is called, this is what's triggering the execution of ScriptLoaderRunnable

Now, the first time this runs it's fine; for some reason however it seems that the last four steps are repeated more than once causing the web worker to be shutdown more than once and that trips the assertion.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Created attachment 737865 [details] [diff] [review]
Unregistering the shutdown observer after the first "web-workers-shutdown"

Gabriele, if your analysis is correct (and it looks correct), this should fix the bug.
Attachment #737865 - Flags: review?(nfroyd)
Attachment #737865 - Flags: feedback?(yura.zenevich)
Flags: needinfo?(jwalden+bmo)
Comment hidden (Treeherder Robot)

Updated

4 years ago
Attachment #737865 - Flags: feedback?(yura.zenevich) → feedback+
Comment on attachment 737865 [details] [diff] [review]
Unregistering the shutdown observer after the first "web-workers-shutdown"

Review of attachment 737865 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/osfile/osfile_async_front.jsm
@@ +183,5 @@
> +  if (aTopic == WEB_WORKERS_SHUTDOWN_TOPIC) {
> +    // Failing to unregister can cause crashes in some tests
> +    // (probably not in production), see bug 845190
> +    Services.obs.removeObserver(webWorkersShutdownObserver, WEB_WORKERS_SHUTDOWN_TOPIC);
> +  }

I don't think you really need the commit; unregistering observers is pretty standard.

But why not unregister TEST_WEB_WORKERS_SHUTDOWN_TOPIC?  Actually, why is *registering* the observer on the test topic even done unconditionally in this file?

@@ +203,5 @@
>    });
>  };
>  
>  // Attaching an observer listening to the "web-workers-shutdown".
> +// The observer is removed once "web-workers-shutdown" is received.

No comment necessary.  The previous one could go too.
Attachment #737865 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #208)
> Comment on attachment 737865 [details] [diff] [review]
> Unregistering the shutdown observer after the first "web-workers-shutdown"
> 
> Review of attachment 737865 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/osfile/osfile_async_front.jsm
> @@ +183,5 @@
> > +  if (aTopic == WEB_WORKERS_SHUTDOWN_TOPIC) {
> > +    // Failing to unregister can cause crashes in some tests
> > +    // (probably not in production), see bug 845190
> > +    Services.obs.removeObserver(webWorkersShutdownObserver, WEB_WORKERS_SHUTDOWN_TOPIC);
> > +  }
> 
> I don't think you really need the commit; unregistering observers is pretty
> standard.

Ok with s/commit/comment/.

> But why not unregister TEST_WEB_WORKERS_SHUTDOWN_TOPIC?

Well, because I do not want to enforce the fact that we are only ever going to run a single test.

> Actually, why is
> *registering* the observer on the test topic even done unconditionally in
> this file?

Because, at this stage, we don't really have a way to determine whether we are being executed in a testing setting. I guess we could wrap this in a preference, but I would rather do this as a followup, if that's ok with you.

> 
> @@ +203,5 @@
> >    });
> >  };
> >  
> >  // Attaching an observer listening to the "web-workers-shutdown".
> > +// The observer is removed once "web-workers-shutdown" is received.
> 
> No comment necessary.  The previous one could go too.

Ok.
Comment on attachment 737865 [details] [diff] [review]
Unregistering the shutdown observer after the first "web-workers-shutdown"

(In reply to David Rajchenbach Teller [:Yoric] from comment #209)
> (In reply to Nathan Froyd (:froydnj) from comment #208)
> > I don't think you really need the commit; unregistering observers is pretty
> > standard.
> 
> Ok with s/commit/comment/.

Right, thank you!

> > But why not unregister TEST_WEB_WORKERS_SHUTDOWN_TOPIC?
> 
> Well, because I do not want to enforce the fact that we are only ever going
> to run a single test.
> 
> > Actually, why is
> > *registering* the observer on the test topic even done unconditionally in
> > this file?
> 
> Because, at this stage, we don't really have a way to determine whether we
> are being executed in a testing setting. I guess we could wrap this in a
> preference, but I would rather do this as a followup, if that's ok with you.

You know how much I like followup bugs. :)  That's fine.

r=me with the changes.
Attachment #737865 - Flags: feedback+ → review+

Updated

4 years ago
Blocks: 862570

Comment 211

4 years ago
The combination of the two patches available so far still does not fix the perma-orange on Windows with the internationalization API enabled:
https://tbpl.mozilla.org/?tree=Try&rev=e9a5f999214f
OS: Linux → All
Created attachment 738449 [details] [diff] [review]
Unregistering the shutdown observer after the first "web-workers-shutdown", v2

Try: https://tbpl.mozilla.org/?tree=Try&rev=9a94ce2801fe
Attachment #737865 - Attachment is obsolete: true
I've tried applying the patch and the test passes correctly however now I see this when running on Linux:

 0:00.40 file:///home/gsvelto/projects/build/firefox-debug/dist/bin/components/nsHandlerService.js:891: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProperties.get]
 0:00.40 JS Component Loader: ERROR /home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:941
 0:00.40                      TypeError: Components is undefined
 0:00.40 file:///home/gsvelto/projects/build/firefox-debug/dist/bin/components/nsHandlerService.js:891: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProperties.get]
 0:00.40 JS Component Loader: ERROR /home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:944
 0:00.40                      TypeError: Components is undefined
 0:00.40 LoadPlugin() /home/gsvelto/projects/build/firefox-debug/dom/plugins/test/testplugin/libnptest.so returned 7fb2a3bde8b0
 0:00.40 LoadPlugin() /home/gsvelto/projects/build/firefox-debug/dom/plugins/test/testplugin/secondplugin/libnpsecondtest.so returned 7fb2a7faad30
 0:00.40 WARNING: NS_ENSURE_TRUE(mPool) failed: file /home/gsvelto/projects/mozilla-central/netwerk/base/src/nsStreamTransportService.cpp, line 465
 0:00.40 WARNING: unable to post continuation event: file /home/gsvelto/projects/mozilla-central/xpcom/io/nsStreamUtils.cpp, line 438
 0:00.40 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0xC1F30001: file /home/gsvelto/projects/mozilla-central/dom/workers/ScriptLoader.cpp, line 323
 0:00.40 WORKER ERROR DURING SETUP Error: Failed to load script: resource://gre/modules/osfile.jsm (nsresult = 0xc1f30001)
 0:00.40 WORKER ERROR DETAIL @resource://gre/modules/osfile/osfile_async_worker.js:16
 0:00.40 @resource://gre/modules/osfile/osfile_async_worker.js:392

Breaking-in with GDB shows that the issue is different, yet a I can still see a worker thread been spun up during shutdown which is then torn down and triggers the warning.
No longer blocks: 862570
Depends on: 862570
The patches are being collapsed into bug 862570.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Updated

4 years ago
Attachment #736684 - Flags: checkin?(jwalden+bmo)
Comment hidden (Treeherder Robot)
Comment on attachment 736684 [details] [diff] [review]
Made test_logging.js correctly ignore messages it wants to ignore

https://hg.mozilla.org/integration/mozilla-inbound/rev/8a2536988191
Attachment #736684 - Flags: checkin?(jwalden+bmo)
https://hg.mozilla.org/mozilla-central/rev/8a2536988191
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Cc'ing Taras and Dougt for any insights, also for point of view on disabling this test and opening a bug to diagnose and re-enable. That would let Waldo land the JS Intl code, so we can make fwd progress there.

/be
We expected that bug 862570 would fix the issue. It seems that this is not the case.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Reporter)

Comment 234

4 years ago
And this explosion of instances is from yet another JS patch, https://hg.mozilla.org/integration/mozilla-inbound/rev/75699d465ba0 from bug 864085, making it permaorange on Windows.

I might be making the wrong call, but I think it's time to disable it.
(Reporter)

Comment 235

4 years ago
Disabled on Windows debug in https://hg.mozilla.org/integration/mozilla-inbound/rev/65463f2f7ba9
Whiteboard: [test disabled on Linux debug][leave open] → [test disabled on Linux debug][test_logging.js disabled on Windows debug][leave open]
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
I'm currently neck deep in bug 838577, but as soon as I have some respite (hopefully this week), I'll resume investigating this bug.
Comment hidden (Treeherder Robot)
https://hg.mozilla.org/mozilla-central/rev/65463f2f7ba9
I'm also stuck fixing hard-to-reproduce B2G crashes but as soon as I'm done I'll be trying to further the investigation I was doing in comment 213. I'm still convinced that the cause is some code that is spawning a worker thread while shutdown is already undergoing. Last time I hit an issue like this it was because of a non-initialized subsystem being shut down but this time it looks different.
Comment hidden (Treeherder Robot)
I've had some time to investigate this further and I can confirm two things:

- The issue is the same on Windows and Linux so this is not platform-specific

- The assertion is triggered by the following chain: ShutdownXPCOM() sends an xpcom-shutdown event to all observers and then spins the event loop with NS_ProcessPendingEvents() to let them run. During this process a new worker thread is spawned; afterwards the xpcom-shutdown-threads event is sent and at this stage RuntimeService::Cleanup() spins the event loop again causing the worker to run and fail with the assertion.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Note: Comment 289 onwards was caused by the landing of bug 860965, which has since been backed out.
Created attachment 748808 [details] [diff] [review]
Better handling of attempts to initialize the OS.File worker thread during shutdown.

Ok, I think that I got it, thanks to comment 274.

I believe that the problem is due to the lazy initialization of worker threads. In a few situations, we end up sending our first message to the OS.File worker thread during shutdown.

In both cases, we import osfile.jsm (which registers a shutdown observer which sends a message to the worker thread to collect information on resource leaks) but never actually perform any I/O (hence never initialize the worker thread before shutdown).

This patch does two things:
- skip this reporting if the worker thread has never been initialized (which is semantically correct - a worker that has never been initialized cannot leak file descriptors);
- throw an error if we attempt to send any other message during shutdown to a yet uninitialized OS.File worker thread.
Attachment #736684 - Attachment is obsolete: true
Attachment #738449 - Attachment is obsolete: true
Attachment #736684 - Flags: feedback?(jonathan.laver)
Attachment #748808 - Flags: review?(nfroyd)
Comment on attachment 748808 [details] [diff] [review]
Better handling of attempts to initialize the OS.File worker thread during shutdown.

Review of attachment 748808 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/osfile/osfile_async_front.jsm
@@ +241,5 @@
>      if (msg) {
>        LOG("WARNING: File descriptors leaks detected.\n" + msg);
>      }
>    });
> +  gShuttingDown = true;

Setting gShuttingDown should be moved further up, before the check of Scheduler.initialized, right?  Otherwise you're never going to hit the error case you added in Scheduler.post.  Or maybe it should even be moved into the conditional for WEB_WORKERS_SHUTDOWN_TOPIC; I don't really understand why this function would have code specially for WEB_WORKERS_SHUTDOWN_TOPIC and then separate code for every other topic.
(In reply to David Rajchenbach Teller [:Yoric] from comment #311)
> Ok, I think that I got it, thanks to comment 274.
> 
> I believe that the problem is due to the lazy initialization of worker
> threads. In a few situations, we end up sending our first message to the
> OS.File worker thread during shutdown.

From what I could see the problem is even deeper. It seems that by the time we hit ShutdownXPCOM() in this test we haven't even yet loaded the worker's JS code. Armed with a lot of patience and stepping carefully through the code I managed to unravel this chain of events:

- nsSearchService.js tries to do an atomic write immediately after the test has executed:

https://mxr.mozilla.org/mozilla-central/source/toolkit/components/search/nsSearchService.js#2703

- This causes the file handling code to spawn a new DOM Worker in _PromiseWorker.jsm, this is done by posting a runnable to via RuntimeService::ScheduleWorker() (this all happens in the main thread)

https://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/_PromiseWorker.jsm#82

- The test finishes and then the WorkerThreadRunnable gets a chance to run and creates the worker. The worker now runs and tries to load the script it needs to execute by invoking LoadWorkerScript() this in turn sends a ScriptLoaderRunnable to execute in the main thread.

- Before the ScriptLoaderRunnable runnable gets a chance to run, the test quits and calls ShutdownXPCOM() which notifies all xpcom-shutdown observers and then spins the event loop here:

https://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#583

- At this stage the ScriptLoaderRunnable starts to run and tries to load the worker's JS code but it's now already too late. It will fail in multiple places which you can see in the log if you run the test, this is the first related error that crops up:

file:///home/gsvelto/projects/build/firefox-debug/dist/bin/components/nsHandlerService.js:895: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProperties.get]
JS Component Loader: ERROR /home/gsvelto/projects/mozilla-central/testing/xpcshell/head.js:957
                     TypeError: Components is undefined

What's causing it is the system trying to establish the MIME type of the source file which is being loaded for the worker. Things basically go south from here but the actual problem is that the worker's JS code has not been loaded yet before we hit ShutdownXPCOM(), so this is just a side-effect.

Comment 314

4 years ago
(In reply to Ed Morley [:edmorley UTC+1] from comment #310)
> Note: Comment 289 onwards was caused by the landing of bug 860965, which has
> since been backed out.

I'm not too familiar with the XPCOM side of things -- can someone shed light on why landing a patch that only shuffles some self-hosted JavaScript code around causes this particular xpcshell failure? I suppose it changed event loop timings?
(In reply to Shu-yu Guo [:shu] from comment #314)
> I'm not too familiar with the XPCOM side of things -- can someone shed light
> on why landing a patch that only shuffles some self-hosted JavaScript code
> around causes this particular xpcshell failure? I suppose it changed event
> loop timings?

Yes, it likely changes the order in which posted events get executed. I'm preparing a patch that tries to drain the event queue more thoroughly before shutting down to see if it fixes the problem.
(In reply to Gabriele Svelto [:gsvelto] from comment #313)
> (In reply to David Rajchenbach Teller [:Yoric] from comment #311)
> > Ok, I think that I got it, thanks to comment 274.
> > 
> > I believe that the problem is due to the lazy initialization of worker
> > threads. In a few situations, we end up sending our first message to the
> > OS.File worker thread during shutdown.
> 
> From what I could see the problem is even deeper. It seems that by the time
> we hit ShutdownXPCOM() in this test we haven't even yet loaded the worker's
> JS code. Armed with a lot of patience and stepping carefully through the
> code I managed to unravel this chain of events:

Yes, that's what I meant with "lazy initialization initialization of worker threads." – namely OS.File's IO worker thread. My patch should raise an asynchronous error whenever this takes place in OS.File.
Created attachment 748965 [details] [diff] [review]
Better handling of attempts to initialize the OS.File worker thread during shutdown, v2

Let's simplify that code a little.
Attachment #748808 - Attachment is obsolete: true
Attachment #748808 - Flags: review?(nfroyd)
Attachment #748965 - Flags: review?(nfroyd)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Updated

4 years ago
Blocks: 860965
(In reply to David Rajchenbach Teller [:Yoric] from comment #316)
> Yes, that's what I meant with "lazy initialization initialization of worker
> threads." – namely OS.File's IO worker thread. My patch should raise an
> asynchronous error whenever this takes place in OS.File.

Yes but the problem here is that the post() command happens before shutdown commences so it doesn't trigger your additional check. The creation of the worker is what's triggering the problem as it fires the ScriptLoaderRunnable that will then run after shutdown has commenced.

I've been looking through the code under dom/workers and I can't seem to find a simple solution to prevent the problem from happening. I think we need to modify RuntimeService::Cleanup() (which runs in response to xpcom-shutdown) in such a way that it ensures that all loader runnables have finished executing before returning. It might that this would also fail if other xpcom-shutdown listeners have already shut themselves down in which case we would need to cancel all ScriptLoaderRunnables before they get to run.
More debugging and more pinpointing... turns out that I had partially misinterpreted the sequence of events. The osfile_async_workers.js source is loaded properly *before* shutdown has initiated but gets executed *afterwards*. Once it starts being executed it will try to import more JS source code here:

https://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/osfile_async_worker.js#16

This in turn creates a new ScriptLoaderRunnable to load the code which then fails.
Mmmh...
This wouldn't be the first time that ScriptLoaderRunnable causes mysterious errors. In this case, my patch probably won't help.
(In reply to David Rajchenbach Teller [:Yoric] from comment #322)
> Mmmh...
> This wouldn't be the first time that ScriptLoaderRunnable causes mysterious
> errors. In this case, my patch probably won't help.

No, and the bigger issue here is that a worker thread can invoke importScripts() recursively so there's no way to ensure that loading a piece of code won't trigger loading another one unless the worker is completely stopped/canceled/killed. Ultimately one of the runnables might end up beyond the ShutdownXPCOM() call and wreak havoc. Unfortunately I'm not sufficiently familiar with this code to come up with a way to change the shutdown procedure and prevent this; I think we need a DOM worker guru here.
Comment hidden (Treeherder Robot)

Comment 325

4 years ago
Is there a workaround at least, or do I have to wait for this to be fixed until I try to reland my stuff?
Component: Search → DOM: Workers
Product: Firefox → Core
It's not clear to me what the problem is here. From comment 323:

> Ultimately one of the runnables might end up beyond the ShutdownXPCOM() call
> and wreak havoc.

This shouldn't be possible as RuntimeService::Cleanup won't return until all worker code is finished and all worker threads joined (and a full flush of the main thread's event queue as well).
Comment on attachment 748965 [details] [diff] [review]
Better handling of attempts to initialize the OS.File worker thread during shutdown, v2

Clearing review in light of the discussion; please reset if this looks like the right fix after all.
Attachment #748965 - Flags: review?(nfroyd)
(In reply to ben turner [:bent] from comment #326)
> This shouldn't be possible as RuntimeService::Cleanup won't return until all
> worker code is finished and all worker threads joined (and a full flush of
> the main thread's event queue as well).

That's what I also thought but somehow it isn't the case here. I can try to break in RuntimeService::Cleanup() and see what's going on there. Is it possible that only the ScriptLoaderRunnable might have "survived" the shutdown procedure while the actual worker was shut down?
No, not really. The worker can't complete until all pending ScriptLoaderRunnables are finished.
(In reply to ben turner [:bent] from comment #329)
> No, not really. The worker can't complete until all pending
> ScriptLoaderRunnables are finished.

I see... that sounds reasonable. I'll go through the test one more time with GDB and I'll post a full trace of what I'm seeing in the hope that it sheds some light on what's going on here.
(In reply to ben turner [:bent] from comment #329)
> No, not really. The worker can't complete until all pending
> ScriptLoaderRunnables are finished.

Your comment pointed me in the right direction. I've checked and while that ScriptLoaderRunnable is running in the DOM worker thread the main thread is waiting in RuntimeService::Cleanup() for the worker to stop so that part is completely fine.

I then went ahead and checked the order of shutdown of the various components and found out that nsStreamTransportService is shut down before that ScriptLoaderRunnable gets a chance to run in response to an xpcom-shutdown-threads event, see here:

https://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsStreamTransportService.cpp#513

I assume this does not happen in general, but in this specific test the nsStreamTransportService gets the xpcom-shutdown-threads before it reaches RuntimeService and thus the latter will clean up after itself "too late". Considering that I'm seeing more errors besides the assertion it might be that other important subsystems have already been shut down by the time the worker thread gets to run its code.
Hm. That's an interesting theory, but then if the stream transport is cleaning itself up I don't see how we could be hitting the assertion that this bug was filed for...
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
For these specific crashes, I could put a wallpaper bug that initializes OS.File in the test early enough that we don't see the crash. Would this be any useful?
Comment hidden (Treeherder Robot)
(In reply to David Rajchenbach Teller [:Yoric] from comment #352)
> For these specific crashes, I could put a wallpaper bug that initializes
> OS.File in the test early enough that we don't see the crash. Would this be
> any useful?

I think that might fix this specific problem; a lot of other similar tests use OS.File and work just fine. We should then keep track in another bug about the root issue (web workers shutdown sequence).
Comment hidden (Treeherder Robot)
I still don't think we understand the root problem here...

If what you said in comment 331 is true ("nsStreamTransportService is shut down before that ScriptLoaderRunnable gets a chance to run") then we shouldn't see the assertion that we're apparently hitting here ("ASSERTION: thread pool wasn't shutdown").
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to ben turner [:bent] from comment #356)
> I still don't think we understand the root problem here...
> 
> If what you said in comment 331 is true ("nsStreamTransportService is shut
> down before that ScriptLoaderRunnable gets a chance to run") then we
> shouldn't see the assertion that we're apparently hitting here ("ASSERTION:
> thread pool wasn't shutdown").

The problem is that something instantiates (GetService'es) the stream transport service AFTER or INSIDE xpcom-shutdown-threads.  The stream transport service then cannot shutdown, it can never get xpcom-shutdown-threads notification, since it's behind it.
As a side-note, this is the kind of situation for which we first discussed replacing observer events by promises in some cases: being able to watch for an event after it had occurred.
(In reply to Honza Bambas (:mayhemer) from comment #359)
> The problem is that something instantiates (GetService'es) the stream
> transport service AFTER or INSIDE xpcom-shutdown-threads.  The stream
> transport service then cannot shutdown, it can never get
> xpcom-shutdown-threads notification, since it's behind it.

Yes, that's happening in RuntimeService::Cleanup() which is waiting for workers to terminate. Since the OS.File worker hasn't even been loaded yet it spins up and starts requesting services while inside xpcom-shutdown-threads which fails.
I'm not 100% sure but the worker code should be able to gracefully fail if the StreamTransportService is not available. It sounds to me like the real fix here should be to make sure that the StreamTransportService does not try to resurrect itself after it has already been shut down.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Comment 365

4 years ago
A gentle reminder that this blocks me from landing self-hosted JS code, and so we should fix this soon. :)
Created attachment 756933 [details] [diff] [review]
Wallpaper patch

Since this prevents other work from landing, here's a wallpaper patch.

Try: https://tbpl.mozilla.org/?tree=Try&rev=e88535999673
Attachment #756933 - Flags: review?(nfroyd)
Sorry, wrong try link.
Try: https://tbpl.mozilla.org/?tree=Try&rev=0161adac5a44
Comment on attachment 756933 [details] [diff] [review]
Wallpaper patch

Review of attachment 756933 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this is the right fix.  Ben's solution from comment 362 sounds reasonable and should be ~10 lines of code, maybe less.
Attachment #756933 - Flags: review?(nfroyd)
> I don't think this is the right fix.  Ben's solution from comment 362 sounds
> reasonable and should be ~10 lines of code, maybe less.

Ok, this will be a larger wallpaper fix. I'll give it a try.
Created attachment 757107 [details] [diff] [review]
Preventing nsStreamTransportService from re-initializing

Is that what you had in mind, Ben?

Try: https://tbpl.mozilla.org/?tree=Try&rev=4e4faf4b5945
Attachment #756933 - Attachment is obsolete: true
Attachment #757107 - Flags: feedback?(bent.mozilla)
Comment on attachment 757107 [details] [diff] [review]
Preventing nsStreamTransportService from re-initializing

Review of attachment 757107 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/src/nsStreamTransportService.cpp
@@ +514,5 @@
>  
>    if (mPool) {
>      mPool->Shutdown();
>      mPool = nullptr;
> +    mHasBeenShutdown = true;

I think you want to set this variable unconditionally.
Created attachment 757126 [details] [diff] [review]
Preventing nsStreamTransportService from re-initializing, v2

Good point.
Attachment #757107 - Attachment is obsolete: true
Attachment #757107 - Flags: feedback?(bent.mozilla)
Attachment #757126 - Flags: feedback?(bent.mozilla)
Comment on attachment 757126 [details] [diff] [review]
Preventing nsStreamTransportService from re-initializing, v2

Review of attachment 757126 [details] [diff] [review]:
-----------------------------------------------------------------

Wouldn't a resurrected nsStreamTransportService actually be a new instance? You have to keep track of the shutdown flag statically.
Attachment #757126 - Flags: feedback?(bent.mozilla) → feedback-
Created attachment 757152 [details] [diff] [review]
Preventing nsStreamTransportService from re-initializing, v3

Good point, too.
Attachment #757126 - Attachment is obsolete: true
Attachment #757152 - Flags: review?(cbiesinger)
Comment hidden (Treeherder Robot)
Created attachment 757178 [details] [diff] [review]
Preventing nsStreamTransportService from re-initializing, v4

Sorry, fixed typo.
Attachment #757152 - Attachment is obsolete: true
Attachment #757152 - Flags: review?(cbiesinger)
Attachment #757178 - Flags: review?
Attachment #757178 - Flags: review? → review?(cbiesinger)