1264694 - Frequent WinXP debug e10s test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash

Reporter

Description

•

9 years ago

I think we have some other known media shutdown issues right now, so this may be a dupe. Maybe bug 1245574 is related? https://treeherder.mozilla.org/logviewer.html#?job_id=166879&repo=ash 06:22:25 INFO - 37 INFO TEST-START | Shutdown 06:22:25 INFO - 38 INFO Passed: 96 06:22:25 INFO - 39 INFO Failed: 0 06:22:25 INFO - 40 INFO Todo: 1 06:22:25 INFO - 41 INFO Mode: e10s 06:22:25 INFO - 42 INFO Slowest: 4873ms - /tests/dom/media/mediasource/test/test_SplitAppendDelay.html 06:22:25 INFO - 43 INFO SimpleTest FINISHED 06:22:25 INFO - 44 INFO TEST-INFO | Ran 1 Loops 06:22:25 INFO - 45 INFO SimpleTest FINISHED 06:22:25 WARNING - TEST-UNEXPECTED-FAIL | dom/media/mediasource/test/test_TruncatedDuration.html | application terminated with exit code 1 06:22:25 INFO - runtests.py | Application ran for: 0:01:38.093000 06:22:25 INFO - zombiecheck | Reading PID log: c:\docume~1\cltbld~1.t-x\locals~1\temp\tmprbnmbxpidlog 06:22:25 INFO - ==> process 824 launched child process 2152 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel="824.0.934698618\1414353303" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -sandbox -appdir "C:\slave\test\build\application\firefox\browser" 824 "\\.\pipe\gecko-crash-server-pipe.824" tab) 06:22:25 INFO - ==> process 824 launched child process 2176 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel="824.5.1207940992\1022539989" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -sandbox -appdir "C:\slave\test\build\application\firefox\browser" 824 "\\.\pipe\gecko-crash-server-pipe.824" tab) 06:22:25 INFO - zombiecheck | Checking for orphan process with PID: 2152 06:22:25 INFO - zombiecheck | Checking for orphan process with PID: 2176 06:22:25 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\docume~1\cltbld~1.t-x\locals~1\temp\tmp4ssegv.mozrunner\minidumps\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp C:\slave\test\build\symbols 06:22:39 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp 06:22:39 INFO - mozcrash Saved app info as C:\slave\test\build\blobber_upload_dir\77a012f8-ac59-4f94-9a98-0daa12c45fe2.extra 06:22:39 WARNING - PROCESS-CRASH | dom/media/mediasource/test/test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] 06:22:39 INFO - Crash dump filename: c:\docume~1\cltbld~1.t-x\locals~1\temp\tmp4ssegv.mozrunner\minidumps\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp 06:22:39 INFO - Operating system: Windows NT 06:22:39 INFO - 5.1.2600 Service Pack 3 06:22:39 INFO - CPU: x86 06:22:39 INFO - GenuineIntel family 6 model 30 stepping 5 06:22:39 INFO - 8 CPUs 06:22:39 INFO - Crash reason: EXCEPTION_BREAKPOINT 06:22:39 INFO - Crash address: 0x58284f7 06:22:39 INFO - Process uptime: 98 seconds 06:22:39 INFO - Thread 41 (crashed) 06:22:39 INFO - 0 xul.dll!mozilla::`anonymous namespace'::RunWatchdog [nsTerminator.cpp:91115264629d : 158 + 0x22] 06:22:39 INFO - eip = 0x058284f7 esp = 0x19bfff44 ebp = 0x19bfff4c ebx = 0x00000000 06:22:39 INFO - esi = 0x0000009e edi = 0x0000003f eax = 0x07170998 ecx = 0x0052705d 06:22:39 INFO - edx = 0x00576140 efl = 0x00000206 06:22:39 INFO - Found by: given as instruction pointer in context 06:22:39 INFO - 1 nss3.dll!_PR_NativeRunThread [pruthr.c:91115264629d : 397 + 0x6] 06:22:39 INFO - eip = 0x00bffd72 esp = 0x19bfff54 ebp = 0x19bfff6c 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 2 nss3.dll!pr_root [w95thred.c:91115264629d : 95 + 0xa] 06:22:39 INFO - eip = 0x00bf3d56 esp = 0x19bfff74 ebp = 0x19bfff78 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 3 ucrtbase.dll!_crt_at_quick_exit + 0x104 06:22:39 INFO - eip = 0x005262a4 esp = 0x19bfff80 ebp = 0x19bfffb4 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 4 kernel32.dll!BaseThreadStart + 0x37 06:22:39 INFO - eip = 0x7c80b713 esp = 0x19bfffbc ebp = 0x19bfffec 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - Thread 0 06:22:39 INFO - 0 ntdll.dll!KiFastSystemCallRet + 0x0 06:22:39 INFO - eip = 0x7c90e4f4 esp = 0x0012f92c ebp = 0x0012f990 ebx = 0x0090f9ec 06:22:39 INFO - esi = 0x00000710 edi = 0x00000000 eax = 0x00002500 ecx = 0x0012f738 06:22:39 INFO - edx = 0x00002525 efl = 0x00000246 06:22:39 INFO - Found by: given as instruction pointer in context 06:22:39 INFO - 1 ntdll.dll!ZwWaitForSingleObject + 0xc 06:22:39 INFO - eip = 0x7c90df3c esp = 0x0012f930 ebp = 0x0012f990 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 2 kernel32.dll!WaitForSingleObjectEx + 0x8b 06:22:39 INFO - eip = 0x7c8025db esp = 0x0012f934 ebp = 0x0012f990 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 3 kernel32.dll!WaitForSingleObject + 0x12 06:22:39 INFO - eip = 0x7c802542 esp = 0x0012f998 ebp = 0x0012f9a4 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 4 nss3.dll!_PR_MD_WAIT_CV [w95cv.c:91115264629d : 248 + 0xf] 06:22:39 INFO - eip = 0x00bf1ddf esp = 0x0012f9ac ebp = 0x0012f9c0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 5 nss3.dll!_PR_WaitCondVar [prucv.c:91115264629d : 172 + 0x1a] 06:22:39 INFO - eip = 0x00bfe654 esp = 0x0012f9c8 ebp = 0x0012f9e0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 6 nss3.dll!PR_WaitCondVar [prucv.c:91115264629d : 525 + 0xb] 06:22:39 INFO - eip = 0x00bfe1c9 esp = 0x0012f9e8 ebp = 0x0012fa04 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 7 xul.dll!mozilla::CondVar::Wait(unsigned int) [BlockingResourceBase.cpp:91115264629d : 501 + 0x9] 06:22:39 INFO - eip = 0x0338c9ee esp = 0x0012fa0c ebp = 0x0012fa24 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 8 xul.dll!nsEventQueue::GetEvent(bool,nsIRunnable * *,mozilla::BaseAutoLock<mozilla::Mutex> &) [nsEventQueue.cpp:91115264629d : 55 + 0xa] 06:22:39 INFO - eip = 0x03362945 esp = 0x0012fa2c ebp = 0x0012fa38 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 9 xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:91115264629d : 984 + 0x23] 06:22:39 INFO - eip = 0x03365421 esp = 0x0012fa40 ebp = 0x0012fb30 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 10 xul.dll!NS_ProcessNextEvent(nsIThread *,bool) [nsThreadUtils.cpp:91115264629d : 290 + 0xd] 06:22:39 INFO - eip = 0x03394b46 esp = 0x0012fb38 ebp = 0x0012fb44 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 11 xul.dll!mozilla::layers::CompositorBridgeParent::ShutDown() [CompositorBridgeParent.cpp:91115264629d : 648 + 0x9] 06:22:39 INFO - eip = 0x03e27a0e esp = 0x0012fb4c ebp = 0x0012fb80 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 12 xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager *) [XPCOMInit.cpp:91115264629d : 872 + 0x5] 06:22:39 INFO - eip = 0x0338c0ca esp = 0x0012fb5c ebp = 0x0012fb80 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 13 xul.dll!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:91115264629d : 1466 + 0x7] 06:22:39 INFO - eip = 0x058346b0 esp = 0x0012fb88 ebp = 0x0012fba0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 14 xul.dll!mozilla::DefaultDelete<ScopedXPCOMStartup>::operator()(ScopedXPCOMStartup *) [UniquePtr.h:91115264629d : 528 + 0xe] 06:22:39 INFO - eip = 0x05834bcc esp = 0x0012fba8 ebp = 0x0012fba8 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 15 xul.dll!XREMain::XRE_main(int,char * * const,nsXREAppData const *) [nsAppRunner.cpp:91115264629d : 4463 + 0x12] 06:22:39 INFO - eip = 0x0583a26d esp = 0x0012fbb0 ebp = 0x0012fbd0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 16 xul.dll!XRE_main [nsAppRunner.cpp:91115264629d : 4543 + 0x12] 06:22:39 INFO - eip = 0x0583cd95 esp = 0x0012fbd8 ebp = 0x0012fce8 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 17 firefox.exe!do_main [nsBrowserApp.cpp:91115264629d : 220 + 0x1c] 06:22:39 INFO - eip = 0x00402884 esp = 0x0012fcf0 ebp = 0x0012fe88 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 18 firefox.exe!NS_internal_main(int,char * *,char * *) [nsBrowserApp.cpp:91115264629d : 360 + 0xf] 06:22:39 INFO - eip = 0x00402180 esp = 0x0012fe90 ebp = 0x0012ff34 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 19 firefox.exe!wmain [nsWindowsWMain.cpp:91115264629d : 135 + 0xe] 06:22:39 INFO - eip = 0x00402cd8 esp = 0x0012ff3c ebp = 0x0012ff74 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 20 firefox.exe!__scrt_common_main_seh [exe_common.inl : 264 + 0x1d] 06:22:39 INFO - eip = 0x004055ba esp = 0x0012ff7c ebp = 0x0012ffc0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 21 kernel32.dll!BaseProcessStart + 0x23 06:22:39 INFO - eip = 0x7c817067 esp = 0x0012ffc8 ebp = 0x0012fff0 06:22:39 INFO - Found by: call frame info

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 1

•

9 years ago

Oh, this is probably a dupe of bug 1261571, which also spiked around the same time. Many of the tests in the mediasource directory are skipped on XP due to lack of MP4 support.

Comment 2

•

9 years ago

Rather, bug 1264082 is the one that has spiked in a similar timeframe.

Comment 3

•

9 years ago

Anthony/Nical - looks like playback and/or gfx (times out while trying to shutdown Compositor): 06:22:39 INFO - 11 xul.dll!mozilla::layers::CompositorBridgeParent::ShutDown() [CompositorBridgeParent.cpp:91115264629d : 648 + 0x9]

Flags: needinfo?(nical.bugzilla)

Flags: needinfo?(ajones)

Comment hidden (Intermittent Failures Robot)

Nicolas Silva [:nical]

Comment 6

•

9 years ago

Fixing bug 1262898 would certainly help a lot here. Basically there are CompositorThreadHolders that are not destroyed in time, and CompositorParent::ShutDown spins the event loop until all COmpositorThreadHolders are gone. CompositorThreadHolders are typically kept by top-level IPDL protocols like CompositorBridgeParent, ImageBridgeParent, etc. which need the compositor thread to be alive to operate. Ensuring that these get shut down in time should fix this (in the case of the two protocols I mentioned, we need their ActorDestroy hook to run, which happens when the channel is closed, either because the other side closed it properly, or because the child process is killed).

Depends on: 1262898

Flags: needinfo?(nical.bugzilla)

Sotaro Ikeda [:sotaro]

Comment 7

•

9 years ago

Bug 1261571 also waiting CompositorBridgeParent::ShutDown(). But it is on linux.

Sotaro Ikeda [:sotaro]

Comment 8

•

9 years ago

Bug 1264082 was OSX.

Comment hidden (Intermittent Failures Robot)

Jim Mathies [:jimm]

Comment 11

•

9 years ago

looks fixed. https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1264694&startday=2016-04-11&endday=2016-04-17&tree=all

Flags: needinfo?(ajones) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 12

•

9 years ago

Nope. https://treeherder.mozilla.org/#/jobs?repo=ash&filter-searchStr=xp%20debug%20mda

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 13

•

9 years ago

It's actually worse now. We've got an additional |PROCESS-CRASH | dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@ RtlpWaitForCriticalSection + 0x5b]| as well these days.

Jim Mathies [:jimm]

Updated

•

9 years ago

tracking-e10s: ? → +

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 14

•

9 years ago

The frequency does appear to be lower now, however. It's failing ~30% instead of closer to 90% of the time.

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

9 years ago

Comment 15

•

9 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #13) > It's actually worse now. We've got an additional |PROCESS-CRASH | > dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@ > RtlpWaitForCriticalSection + 0x5b]| as well these days. I've spun this off to bug 1268332.

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

9 years ago

Summary: Nearly permafail WinXP debug test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash → Frequent WinXP debug e10s test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

9 years ago

Whiteboard: [e10s-orangeblockers]

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 17

•

9 years ago

Blake - does JW have time to look into this?

Flags: needinfo?(bwu)

JW Wang [:jwwang] [:jw_wang]

Comment 18

•

9 years ago

In my queue.

Assignee: nobody → jwwang

Flags: needinfo?(bwu)

Bas Schouten (:bas.schouten)

Comment 19

•

9 years ago

I've done try pushes with a lot of logging, it's not entirely clear what's going on here. We really need to get the stack of the child process when it dies. See also: https://treeherder.mozilla.org/logviewer.html#?job_id=20537210&repo=try#L6273

Depends on: 1270172

Bas Schouten (:bas.schouten)

Comment 20

•

9 years ago

I've created some code that will crash the child process when it's hung. We can see that this is indeed executing properly. The main process no exits successfully after 30 seconds of the child process being hung, and we're getting stack traces for the child process. I can't pretend to understand yet what's going on inside the child process: https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387

Randell Jesup [:jesup] (needinfo me)

Comment 21

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #20) > I've created some code that will crash the child process when it's hung. We > can see that this is indeed executing properly. The main process no exits > successfully after 30 seconds of the child process being hung, and we're > getting stack traces for the child process. I can't pretend to understand > yet what's going on inside the child process: > > https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387 Thanks Bas! So the crash seems to indicate that the SharedThreadPool queue is not empty, and shutdown is stalled waiting for it. Almost certainly this means something is deadlocked or hung elsewhere, causing the shared pool to never empty. Threads 18-29 look suspicious: xul.dll!nsThreadPool::Run() which calls CondVar::Wait() Any good ideas on how to figure out what's blocking SharedPool shutdown? Perhaps there are some ways to notice/debug "hung" sharedpool states? Or to dump out which-thread-owns-which-named-mutex at crash-reporter time?

Flags: needinfo?(nfroyd)

Bas Schouten (:bas.schouten)

Comment 22

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #21) > (In reply to Bas Schouten (:bas.schouten) from comment #20) > > I've created some code that will crash the child process when it's hung. We > > can see that this is indeed executing properly. The main process no exits > > successfully after 30 seconds of the child process being hung, and we're > > getting stack traces for the child process. I can't pretend to understand > > yet what's going on inside the child process: > > > > https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387 > > Thanks Bas! I've been looking at this a little more as well, the only things that use SharedThreadPools seem to be in Media code. These 'pools' are supposed to be released explicitly. I'm currently looking at the Media code to figure out what could be going wrong. Maybe someone from media land has ideas. > > So the crash seems to indicate that the SharedThreadPool queue is not empty, > and shutdown is stalled waiting for it. Almost certainly this means > something is deadlocked or hung elsewhere, causing the shared pool to never > empty. > > Threads 18-29 look suspicious: xul.dll!nsThreadPool::Run() which calls > CondVar::Wait() > > Any good ideas on how to figure out what's blocking SharedPool shutdown? > Perhaps there are some ways to notice/debug "hung" sharedpool states? Or to > dump out which-thread-owns-which-named-mutex at crash-reporter time?

Flags: needinfo?(cpearce)

Randell Jesup [:jesup] (needinfo me)

Comment 23

•

9 years ago

The SharedThreadPools used by webrtc are currently only for the VideoFrameConverter in MediaPipeline, and I doubt that's involved here.

Bas Schouten (:bas.schouten)

Comment 24

•

9 years ago

The most suspicious of the two users would appear to be AsyncCubebTask::EnsureThread, I've moved the shutdown clearing there to an earlier phase of shutdown for a try push, and I've done another try push where I simply do some more logging. Hopefully that will give us the information of roughly where the culprit is.

Bas Schouten (:bas.schouten)

Comment 25

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #23) > The SharedThreadPools used by webrtc are currently only for the > VideoFrameConverter in MediaPipeline, and I doubt that's involved here. Yeah I didn't add the WebRTC one in my logging assuming it wouldn't be relevant. I added logging to the two other users that seem like they'd be relevant: https://dxr.mozilla.org/mozilla-central/search?q=SharedThreadPool%3A%3AGet&redirect=false&case=true

Nathan Froyd [:froydnj]

Comment 26

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #21) > Any good ideas on how to figure out what's blocking SharedPool shutdown? > Perhaps there are some ways to notice/debug "hung" sharedpool states? Or to > dump out which-thread-owns-which-named-mutex at crash-reporter time? Somebody's holding a ref to the SharedThreadPool; perhaps refcnt logging on SharedThreadPool would tell you who's hanging onto the pool for longer than they should be. Or perhaps the somebody who's supposed to release the last shared ref at xpcom-shutdown is a later observer than the SharedThreadPool, which means deadlock.

Flags: needinfo?(nfroyd)

Bas Schouten (:bas.schouten)

Comment 27

•

9 years ago

So I have two theories here: 1. There's a theoretical race condition where AsyncCubebTask::EnsureThread gets called off the MainThread after Mozilla::KillClearShutdown is called for the ShutdownThreads phase. 2. There's cycle collected objects holding on to a MediaTimer object that's holding on to the SharedThreadPool.

Bas Schouten (:bas.schouten)

Comment 28

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #27) > So I have two theories here: > > 1. There's a theoretical race condition where AsyncCubebTask::EnsureThread > gets called off the MainThread after Mozilla::KillClearShutdown is called > for the ShutdownThreads phase. > 2. There's cycle collected objects holding on to a MediaTimer object that's > holding on to the SharedThreadPool. For the record, I'm pushing logging additions to figure out which one it could be.

Bas Schouten (:bas.schouten)

Comment 29

•

9 years ago

[Child 3440] WARNING: NS_ENSURE_TRUE(context) failed: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/threads/nsThread.cpp, line 802 This is also in the log which seems somewhat suspicious as it could cause events on a thread to go unprocessed, however I see these a lot in our logs and it seems to be mostly innocent.

Randell Jesup [:jesup] (needinfo me)

Comment 30

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #29) > [Child 3440] WARNING: NS_ENSURE_TRUE(context) failed: file > c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/threads/ > nsThread.cpp, line 802 > > This is also in the log which seems somewhat suspicious as it could cause > events on a thread to go unprocessed, however I see these a lot in our logs > and it seems to be mostly innocent. That's a symptom of ShutdownInternal() early-exiting because (likely) a thread is trying to call Shutdown() on itself, which is a no-no

JW Wang [:jwwang] [:jw_wang]

Comment 31

•

9 years ago

https://hg.mozilla.org/mozilla-central/file/3461f3cae78495f100a0f7d3d2e0b89292d3ec02/xpcom/threads/SharedThreadPool.cpp#l146 It looks like it is safe to release the last ref count on the task queue thread. Btw, as Nathan said, whenever you leak a SharedThreadPool, you will have shutdown hang in SharedThreadPool::SpinUntilEmpty().

Bas Schouten (:bas.schouten)

Comment 32

•

9 years ago

(In reply to JW Wang [:jwwang] from comment #31) > https://hg.mozilla.org/mozilla-central/file/ > 3461f3cae78495f100a0f7d3d2e0b89292d3ec02/xpcom/threads/SharedThreadPool. > cpp#l146 > > It looks like it is safe to release the last ref count on the task queue > thread. > > Btw, as Nathan said, whenever you leak a SharedThreadPool, you will have > shutdown hang in SharedThreadPool::SpinUntilEmpty(). Right.. the key is to figure out who's leaking it.. well, media code somehow most likely since they're the only users.. but how :). http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-ea8d4359c8383ceedb2ab28b37522a5c2b13dcba/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm110-tests1-windows-build22.txt.gz Seems to indicate there's no alive MediaTimers at least when the hang occurs.

Bas Schouten (:bas.schouten)

Comment 33

•

9 years ago

There seems to be one bug here at least, although at this point I'm not certain that is the cause here: while (!IsEmpty()) { sMonitor->AssertNotCurrentThreadIn(); NS_ProcessNextEvent(NS_GetCurrentThread(), true); } This is passing aMayWait as true, which seems fine, but it's possible for the last SharedThreadPool to be destroyed -off- the main thread without an event ever being processed by the main thread as far as I can tell. Which means it would just sit here without ever re-checking the pool size. I have tests pushed to see if this is the case.

Bas Schouten (:bas.schouten)

Comment 34

•

9 years ago

So here's a log that gives us two things: http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-70ef09533913577b0e92409e904c54f4b2611f47/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm126-tests1-windows-build22.txt.gz Look around 14:10:03. 1. There's the problem I illustrated above, I added code to report every second what pools were still alive. But there's obvious breaks of several seconds during which SpinUntilEmpty will just sit there and not do anything. That's probably something we should address but not the underlying problem. 2. More importantly the problematic thread pool is a MediaThreadType::PLATFORM_DECODER, which seems to be used by TaskQueue or FlushableTaskQueue.. I'll continue digging into this but it would be really great if someone from the media team actually chimed in here to help a little bit.

Randell Jesup [:jesup] (needinfo me)

Comment 35

•

9 years ago

As was seen elsewhere, TaskQueue can't release it's last reference from itself generally; not sure if that's involved here. Bas: thanks for tracking it down to a Platform Decoder pool! Adding jya

Flags: needinfo?(jwwang)

Bas Schouten (:bas.schouten)

Comment 36

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #35) > As was seen elsewhere, TaskQueue can't release it's last reference from > itself generally; not sure if that's involved here. > > Bas: thanks for tracking it down to a Platform Decoder pool! > > Adding jya I stand corrected, I copy-paste failed. It's the PLAYBACK pool!

Flags: needinfo?(rjesup)

Bas Schouten (:bas.schouten)

Comment 37

•

9 years ago

I wonder if something that is Cycle Collected is holding on to a Task Queue, since ShutdownThreads occurs way before final cycle collection does.

JW Wang [:jwwang] [:jw_wang]

Comment 38

•

9 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #33) > There seems to be one bug here at least, although at this point I'm not > certain that is the cause here: > > while (!IsEmpty()) { > sMonitor->AssertNotCurrentThreadIn(); > NS_ProcessNextEvent(NS_GetCurrentThread(), true); > } > > This is passing aMayWait as true, which seems fine, but it's possible for > the last SharedThreadPool to be destroyed -off- the main thread without an > event ever being processed by the main thread as far as I can tell. Which > means it would just sit here without ever re-checking the pool size. I have > tests pushed to see if this is the case. SharedThreadPool::Release(void) { sPools->Remove(mName); NS_DispatchToMainThread(NewRunnableMethod(mPool, &nsIThreadPool::Shutdown)); } It posts an event the the main thread after removing the string from the pool. So the situation you described will not happen.

JW Wang [:jwwang] [:jw_wang]

Comment 39

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #35) > As was seen elsewhere, TaskQueue can't release it's last reference from > itself generally; not sure if that's involved here. per comment 31: SharedThreadPool::Release(void) { NS_DispatchToMainThread(NewRunnableMethod(mPool, &nsIThreadPool::Shutdown)); } The underlying mPool is always shut down in the main thread. So I think it is safe to release the last ref count from the task queue thread IIUC.

Flags: needinfo?(jwwang)

Bas Schouten (:bas.schouten)

Comment 40

•

9 years ago

(In reply to JW Wang [:jwwang] from comment #39) > (In reply to Randell Jesup [:jesup] from comment #35) > > As was seen elsewhere, TaskQueue can't release it's last reference from > > itself generally; not sure if that's involved here. > > per comment 31: > > SharedThreadPool::Release(void) > { > NS_DispatchToMainThread(NewRunnableMethod(mPool, > &nsIThreadPool::Shutdown)); > } > > The underlying mPool is always shut down in the main thread. So I think it > is safe to release the last ref count from the task queue thread IIUC. Ah, yep, you're right, still, that should probably be documented, sort of something that's easily changed without going back to change this. In any case, as I said, not the problem here anyway. The problem here is just some media object staying alive holding on to a taskqueue. I've done an additional push to figure out what type of object it is.

JW Wang [:jwwang] [:jw_wang]

Comment 41

•

9 years ago

I think the problem is whenever you leak a SharedThreadPool, you will have shutdown hang in SharedThreadPool::SpinUntilEmpty() which makes it misleading and hard to debug leaks. I am thinking can we just remove SharedThreadPool::SpinUntilEmpty() and depend on the client code of SharedThreadPool to finish all its jobs before xpcom-shutdown. A class as MediaShutdownManager should definitely ease the pain of management. However, it is still up to the client code to ensure the correct shutdown sequence. Hi Chris, What do you think about it?

Bas Schouten (:bas.schouten)

Comment 42

•

9 years ago

(In reply to JW Wang [:jwwang] from comment #41) > I think the problem is whenever you leak a SharedThreadPool, you will have > shutdown hang in SharedThreadPool::SpinUntilEmpty() which makes it > misleading and hard to debug leaks. > > I am thinking can we just remove SharedThreadPool::SpinUntilEmpty() and > depend on the client code of SharedThreadPool to finish all its jobs before > xpcom-shutdown. A class as MediaShutdownManager should definitely ease the > pain of management. However, it is still up to the client code to ensure the > correct shutdown sequence. > > Hi Chris, > What do you think about it? So well, I still think we should address the actual issues here. Maybe timing out in SpinUntilEmpty and logging what got leaked is a useful thing to do. In any case, we are leaking a MediaDecoderReader element. Could that be due to cycle collection?

Flags: needinfo?(jyavenard)

part1_forece_e10s_tests.patch 9 years ago JW Wang [:jwwang] [:jw_wang] 3.01 KB, patch		Details \| Diff \| Splinter Review
part2_debug_leaks.patch 9 years ago JW Wang [:jwwang] [:jw_wang] 5.71 KB, patch		Details \| Diff \| Splinter Review
MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang 9 years ago Jean-Yves Avenard [:jya] 58 bytes, text/x-review-board-request	jwwang : review+	Details
MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang 9 years ago Jean-Yves Avenard [:jya] 58 bytes, text/x-review-board-request	jwwang : review+	Details
MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang 9 years ago Jean-Yves Avenard [:jya] 58 bytes, text/x-review-board-request	jwwang : review+ ritu : approval-mozilla-aurora+	Details