Closed Bug 1264694 Opened 9 years ago Closed 9 years ago

Frequent WinXP debug e10s test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash

Categories

(Core :: Audio/Video: Playback, defect)

Unspecified
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
e10s + ---
firefox48 --- fixed
firefox49 --- fixed

People

(Reporter: RyanVM, Assigned: jya)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: crash, intermittent-failure, Whiteboard: [e10s-orangeblockers])

Attachments

(5 files)

I think we have some other known media shutdown issues right now, so this may be a dupe. Maybe bug 1245574 is related? https://treeherder.mozilla.org/logviewer.html#?job_id=166879&repo=ash 06:22:25 INFO - 37 INFO TEST-START | Shutdown 06:22:25 INFO - 38 INFO Passed: 96 06:22:25 INFO - 39 INFO Failed: 0 06:22:25 INFO - 40 INFO Todo: 1 06:22:25 INFO - 41 INFO Mode: e10s 06:22:25 INFO - 42 INFO Slowest: 4873ms - /tests/dom/media/mediasource/test/test_SplitAppendDelay.html 06:22:25 INFO - 43 INFO SimpleTest FINISHED 06:22:25 INFO - 44 INFO TEST-INFO | Ran 1 Loops 06:22:25 INFO - 45 INFO SimpleTest FINISHED 06:22:25 WARNING - TEST-UNEXPECTED-FAIL | dom/media/mediasource/test/test_TruncatedDuration.html | application terminated with exit code 1 06:22:25 INFO - runtests.py | Application ran for: 0:01:38.093000 06:22:25 INFO - zombiecheck | Reading PID log: c:\docume~1\cltbld~1.t-x\locals~1\temp\tmprbnmbxpidlog 06:22:25 INFO - ==> process 824 launched child process 2152 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel="824.0.934698618\1414353303" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -sandbox -appdir "C:\slave\test\build\application\firefox\browser" 824 "\\.\pipe\gecko-crash-server-pipe.824" tab) 06:22:25 INFO - ==> process 824 launched child process 2176 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel="824.5.1207940992\1022539989" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -sandbox -appdir "C:\slave\test\build\application\firefox\browser" 824 "\\.\pipe\gecko-crash-server-pipe.824" tab) 06:22:25 INFO - zombiecheck | Checking for orphan process with PID: 2152 06:22:25 INFO - zombiecheck | Checking for orphan process with PID: 2176 06:22:25 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\docume~1\cltbld~1.t-x\locals~1\temp\tmp4ssegv.mozrunner\minidumps\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp C:\slave\test\build\symbols 06:22:39 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp 06:22:39 INFO - mozcrash Saved app info as C:\slave\test\build\blobber_upload_dir\77a012f8-ac59-4f94-9a98-0daa12c45fe2.extra 06:22:39 WARNING - PROCESS-CRASH | dom/media/mediasource/test/test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] 06:22:39 INFO - Crash dump filename: c:\docume~1\cltbld~1.t-x\locals~1\temp\tmp4ssegv.mozrunner\minidumps\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp 06:22:39 INFO - Operating system: Windows NT 06:22:39 INFO - 5.1.2600 Service Pack 3 06:22:39 INFO - CPU: x86 06:22:39 INFO - GenuineIntel family 6 model 30 stepping 5 06:22:39 INFO - 8 CPUs 06:22:39 INFO - Crash reason: EXCEPTION_BREAKPOINT 06:22:39 INFO - Crash address: 0x58284f7 06:22:39 INFO - Process uptime: 98 seconds 06:22:39 INFO - Thread 41 (crashed) 06:22:39 INFO - 0 xul.dll!mozilla::`anonymous namespace'::RunWatchdog [nsTerminator.cpp:91115264629d : 158 + 0x22] 06:22:39 INFO - eip = 0x058284f7 esp = 0x19bfff44 ebp = 0x19bfff4c ebx = 0x00000000 06:22:39 INFO - esi = 0x0000009e edi = 0x0000003f eax = 0x07170998 ecx = 0x0052705d 06:22:39 INFO - edx = 0x00576140 efl = 0x00000206 06:22:39 INFO - Found by: given as instruction pointer in context 06:22:39 INFO - 1 nss3.dll!_PR_NativeRunThread [pruthr.c:91115264629d : 397 + 0x6] 06:22:39 INFO - eip = 0x00bffd72 esp = 0x19bfff54 ebp = 0x19bfff6c 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 2 nss3.dll!pr_root [w95thred.c:91115264629d : 95 + 0xa] 06:22:39 INFO - eip = 0x00bf3d56 esp = 0x19bfff74 ebp = 0x19bfff78 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 3 ucrtbase.dll!_crt_at_quick_exit + 0x104 06:22:39 INFO - eip = 0x005262a4 esp = 0x19bfff80 ebp = 0x19bfffb4 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 4 kernel32.dll!BaseThreadStart + 0x37 06:22:39 INFO - eip = 0x7c80b713 esp = 0x19bfffbc ebp = 0x19bfffec 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - Thread 0 06:22:39 INFO - 0 ntdll.dll!KiFastSystemCallRet + 0x0 06:22:39 INFO - eip = 0x7c90e4f4 esp = 0x0012f92c ebp = 0x0012f990 ebx = 0x0090f9ec 06:22:39 INFO - esi = 0x00000710 edi = 0x00000000 eax = 0x00002500 ecx = 0x0012f738 06:22:39 INFO - edx = 0x00002525 efl = 0x00000246 06:22:39 INFO - Found by: given as instruction pointer in context 06:22:39 INFO - 1 ntdll.dll!ZwWaitForSingleObject + 0xc 06:22:39 INFO - eip = 0x7c90df3c esp = 0x0012f930 ebp = 0x0012f990 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 2 kernel32.dll!WaitForSingleObjectEx + 0x8b 06:22:39 INFO - eip = 0x7c8025db esp = 0x0012f934 ebp = 0x0012f990 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 3 kernel32.dll!WaitForSingleObject + 0x12 06:22:39 INFO - eip = 0x7c802542 esp = 0x0012f998 ebp = 0x0012f9a4 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 4 nss3.dll!_PR_MD_WAIT_CV [w95cv.c:91115264629d : 248 + 0xf] 06:22:39 INFO - eip = 0x00bf1ddf esp = 0x0012f9ac ebp = 0x0012f9c0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 5 nss3.dll!_PR_WaitCondVar [prucv.c:91115264629d : 172 + 0x1a] 06:22:39 INFO - eip = 0x00bfe654 esp = 0x0012f9c8 ebp = 0x0012f9e0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 6 nss3.dll!PR_WaitCondVar [prucv.c:91115264629d : 525 + 0xb] 06:22:39 INFO - eip = 0x00bfe1c9 esp = 0x0012f9e8 ebp = 0x0012fa04 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 7 xul.dll!mozilla::CondVar::Wait(unsigned int) [BlockingResourceBase.cpp:91115264629d : 501 + 0x9] 06:22:39 INFO - eip = 0x0338c9ee esp = 0x0012fa0c ebp = 0x0012fa24 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 8 xul.dll!nsEventQueue::GetEvent(bool,nsIRunnable * *,mozilla::BaseAutoLock<mozilla::Mutex> &) [nsEventQueue.cpp:91115264629d : 55 + 0xa] 06:22:39 INFO - eip = 0x03362945 esp = 0x0012fa2c ebp = 0x0012fa38 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 9 xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:91115264629d : 984 + 0x23] 06:22:39 INFO - eip = 0x03365421 esp = 0x0012fa40 ebp = 0x0012fb30 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 10 xul.dll!NS_ProcessNextEvent(nsIThread *,bool) [nsThreadUtils.cpp:91115264629d : 290 + 0xd] 06:22:39 INFO - eip = 0x03394b46 esp = 0x0012fb38 ebp = 0x0012fb44 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 11 xul.dll!mozilla::layers::CompositorBridgeParent::ShutDown() [CompositorBridgeParent.cpp:91115264629d : 648 + 0x9] 06:22:39 INFO - eip = 0x03e27a0e esp = 0x0012fb4c ebp = 0x0012fb80 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 12 xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager *) [XPCOMInit.cpp:91115264629d : 872 + 0x5] 06:22:39 INFO - eip = 0x0338c0ca esp = 0x0012fb5c ebp = 0x0012fb80 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 13 xul.dll!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:91115264629d : 1466 + 0x7] 06:22:39 INFO - eip = 0x058346b0 esp = 0x0012fb88 ebp = 0x0012fba0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 14 xul.dll!mozilla::DefaultDelete<ScopedXPCOMStartup>::operator()(ScopedXPCOMStartup *) [UniquePtr.h:91115264629d : 528 + 0xe] 06:22:39 INFO - eip = 0x05834bcc esp = 0x0012fba8 ebp = 0x0012fba8 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 15 xul.dll!XREMain::XRE_main(int,char * * const,nsXREAppData const *) [nsAppRunner.cpp:91115264629d : 4463 + 0x12] 06:22:39 INFO - eip = 0x0583a26d esp = 0x0012fbb0 ebp = 0x0012fbd0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 16 xul.dll!XRE_main [nsAppRunner.cpp:91115264629d : 4543 + 0x12] 06:22:39 INFO - eip = 0x0583cd95 esp = 0x0012fbd8 ebp = 0x0012fce8 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 17 firefox.exe!do_main [nsBrowserApp.cpp:91115264629d : 220 + 0x1c] 06:22:39 INFO - eip = 0x00402884 esp = 0x0012fcf0 ebp = 0x0012fe88 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 18 firefox.exe!NS_internal_main(int,char * *,char * *) [nsBrowserApp.cpp:91115264629d : 360 + 0xf] 06:22:39 INFO - eip = 0x00402180 esp = 0x0012fe90 ebp = 0x0012ff34 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 19 firefox.exe!wmain [nsWindowsWMain.cpp:91115264629d : 135 + 0xe] 06:22:39 INFO - eip = 0x00402cd8 esp = 0x0012ff3c ebp = 0x0012ff74 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 20 firefox.exe!__scrt_common_main_seh [exe_common.inl : 264 + 0x1d] 06:22:39 INFO - eip = 0x004055ba esp = 0x0012ff7c ebp = 0x0012ffc0 06:22:39 INFO - Found by: call frame info 06:22:39 INFO - 21 kernel32.dll!BaseProcessStart + 0x23 06:22:39 INFO - eip = 0x7c817067 esp = 0x0012ffc8 ebp = 0x0012fff0 06:22:39 INFO - Found by: call frame info
Oh, this is probably a dupe of bug 1261571, which also spiked around the same time. Many of the tests in the mediasource directory are skipped on XP due to lack of MP4 support.
See Also: → 1261571
Rather, bug 1264082 is the one that has spiked in a similar timeframe.
See Also: → 1264082
Anthony/Nical - looks like playback and/or gfx (times out while trying to shutdown Compositor): 06:22:39 INFO - 11 xul.dll!mozilla::layers::CompositorBridgeParent::ShutDown() [CompositorBridgeParent.cpp:91115264629d : 648 + 0x9]
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(ajones)
Fixing bug 1262898 would certainly help a lot here. Basically there are CompositorThreadHolders that are not destroyed in time, and CompositorParent::ShutDown spins the event loop until all COmpositorThreadHolders are gone. CompositorThreadHolders are typically kept by top-level IPDL protocols like CompositorBridgeParent, ImageBridgeParent, etc. which need the compositor thread to be alive to operate. Ensuring that these get shut down in time should fix this (in the case of the two protocols I mentioned, we need their ActorDestroy hook to run, which happens when the channel is closed, either because the other side closed it properly, or because the child process is killed).
Depends on: 1262898
Flags: needinfo?(nical.bugzilla)
Bug 1261571 also waiting CompositorBridgeParent::ShutDown(). But it is on linux.
It's actually worse now. We've got an additional |PROCESS-CRASH | dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@ RtlpWaitForCriticalSection + 0x5b]| as well these days.
The frequency does appear to be lower now, however. It's failing ~30% instead of closer to 90% of the time.
See Also: → 1268332
(In reply to Ryan VanderMeulen [:RyanVM] from comment #13) > It's actually worse now. We've got an additional |PROCESS-CRASH | > dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@ > RtlpWaitForCriticalSection + 0x5b]| as well these days. I've spun this off to bug 1268332.
Summary: Nearly permafail WinXP debug test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash → Frequent WinXP debug e10s test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash
Whiteboard: [e10s-orangeblockers]
Blake - does JW have time to look into this?
Flags: needinfo?(bwu)
In my queue.
Assignee: nobody → jwwang
Flags: needinfo?(bwu)
I've done try pushes with a lot of logging, it's not entirely clear what's going on here. We really need to get the stack of the child process when it dies. See also: https://treeherder.mozilla.org/logviewer.html#?job_id=20537210&repo=try#L6273
Depends on: 1270172
I've created some code that will crash the child process when it's hung. We can see that this is indeed executing properly. The main process no exits successfully after 30 seconds of the child process being hung, and we're getting stack traces for the child process. I can't pretend to understand yet what's going on inside the child process: https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387
(In reply to Bas Schouten (:bas.schouten) from comment #20) > I've created some code that will crash the child process when it's hung. We > can see that this is indeed executing properly. The main process no exits > successfully after 30 seconds of the child process being hung, and we're > getting stack traces for the child process. I can't pretend to understand > yet what's going on inside the child process: > > https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387 Thanks Bas! So the crash seems to indicate that the SharedThreadPool queue is not empty, and shutdown is stalled waiting for it. Almost certainly this means something is deadlocked or hung elsewhere, causing the shared pool to never empty. Threads 18-29 look suspicious: xul.dll!nsThreadPool::Run() which calls CondVar::Wait() Any good ideas on how to figure out what's blocking SharedPool shutdown? Perhaps there are some ways to notice/debug "hung" sharedpool states? Or to dump out which-thread-owns-which-named-mutex at crash-reporter time?
Flags: needinfo?(nfroyd)
(In reply to Randell Jesup [:jesup] from comment #21) > (In reply to Bas Schouten (:bas.schouten) from comment #20) > > I've created some code that will crash the child process when it's hung. We > > can see that this is indeed executing properly. The main process no exits > > successfully after 30 seconds of the child process being hung, and we're > > getting stack traces for the child process. I can't pretend to understand > > yet what's going on inside the child process: > > > > https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387 > > Thanks Bas! I've been looking at this a little more as well, the only things that use SharedThreadPools seem to be in Media code. These 'pools' are supposed to be released explicitly. I'm currently looking at the Media code to figure out what could be going wrong. Maybe someone from media land has ideas. > > So the crash seems to indicate that the SharedThreadPool queue is not empty, > and shutdown is stalled waiting for it. Almost certainly this means > something is deadlocked or hung elsewhere, causing the shared pool to never > empty. > > Threads 18-29 look suspicious: xul.dll!nsThreadPool::Run() which calls > CondVar::Wait() > > Any good ideas on how to figure out what's blocking SharedPool shutdown? > Perhaps there are some ways to notice/debug "hung" sharedpool states? Or to > dump out which-thread-owns-which-named-mutex at crash-reporter time?
Flags: needinfo?(cpearce)
The SharedThreadPools used by webrtc are currently only for the VideoFrameConverter in MediaPipeline, and I doubt that's involved here.
The most suspicious of the two users would appear to be AsyncCubebTask::EnsureThread, I've moved the shutdown clearing there to an earlier phase of shutdown for a try push, and I've done another try push where I simply do some more logging. Hopefully that will give us the information of roughly where the culprit is.
(In reply to Randell Jesup [:jesup] from comment #23) > The SharedThreadPools used by webrtc are currently only for the > VideoFrameConverter in MediaPipeline, and I doubt that's involved here. Yeah I didn't add the WebRTC one in my logging assuming it wouldn't be relevant. I added logging to the two other users that seem like they'd be relevant: https://dxr.mozilla.org/mozilla-central/search?q=SharedThreadPool%3A%3AGet&redirect=false&case=true
(In reply to Randell Jesup [:jesup] from comment #21) > Any good ideas on how to figure out what's blocking SharedPool shutdown? > Perhaps there are some ways to notice/debug "hung" sharedpool states? Or to > dump out which-thread-owns-which-named-mutex at crash-reporter time? Somebody's holding a ref to the SharedThreadPool; perhaps refcnt logging on SharedThreadPool would tell you who's hanging onto the pool for longer than they should be. Or perhaps the somebody who's supposed to release the last shared ref at xpcom-shutdown is a later observer than the SharedThreadPool, which means deadlock.
Flags: needinfo?(nfroyd)
So I have two theories here: 1. There's a theoretical race condition where AsyncCubebTask::EnsureThread gets called off the MainThread after Mozilla::KillClearShutdown is called for the ShutdownThreads phase. 2. There's cycle collected objects holding on to a MediaTimer object that's holding on to the SharedThreadPool.
(In reply to Bas Schouten (:bas.schouten) from comment #27) > So I have two theories here: > > 1. There's a theoretical race condition where AsyncCubebTask::EnsureThread > gets called off the MainThread after Mozilla::KillClearShutdown is called > for the ShutdownThreads phase. > 2. There's cycle collected objects holding on to a MediaTimer object that's > holding on to the SharedThreadPool. For the record, I'm pushing logging additions to figure out which one it could be.
[Child 3440] WARNING: NS_ENSURE_TRUE(context) failed: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/threads/nsThread.cpp, line 802 This is also in the log which seems somewhat suspicious as it could cause events on a thread to go unprocessed, however I see these a lot in our logs and it seems to be mostly innocent.
(In reply to Bas Schouten (:bas.schouten) from comment #29) > [Child 3440] WARNING: NS_ENSURE_TRUE(context) failed: file > c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/threads/ > nsThread.cpp, line 802 > > This is also in the log which seems somewhat suspicious as it could cause > events on a thread to go unprocessed, however I see these a lot in our logs > and it seems to be mostly innocent. That's a symptom of ShutdownInternal() early-exiting because (likely) a thread is trying to call Shutdown() on itself, which is a no-no
https://hg.mozilla.org/mozilla-central/file/3461f3cae78495f100a0f7d3d2e0b89292d3ec02/xpcom/threads/SharedThreadPool.cpp#l146 It looks like it is safe to release the last ref count on the task queue thread. Btw, as Nathan said, whenever you leak a SharedThreadPool, you will have shutdown hang in SharedThreadPool::SpinUntilEmpty().
(In reply to JW Wang [:jwwang] from comment #31) > https://hg.mozilla.org/mozilla-central/file/ > 3461f3cae78495f100a0f7d3d2e0b89292d3ec02/xpcom/threads/SharedThreadPool. > cpp#l146 > > It looks like it is safe to release the last ref count on the task queue > thread. > > Btw, as Nathan said, whenever you leak a SharedThreadPool, you will have > shutdown hang in SharedThreadPool::SpinUntilEmpty(). Right.. the key is to figure out who's leaking it.. well, media code somehow most likely since they're the only users.. but how :). http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-ea8d4359c8383ceedb2ab28b37522a5c2b13dcba/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm110-tests1-windows-build22.txt.gz Seems to indicate there's no alive MediaTimers at least when the hang occurs.
There seems to be one bug here at least, although at this point I'm not certain that is the cause here: while (!IsEmpty()) { sMonitor->AssertNotCurrentThreadIn(); NS_ProcessNextEvent(NS_GetCurrentThread(), true); } This is passing aMayWait as true, which seems fine, but it's possible for the last SharedThreadPool to be destroyed -off- the main thread without an event ever being processed by the main thread as far as I can tell. Which means it would just sit here without ever re-checking the pool size. I have tests pushed to see if this is the case.
So here's a log that gives us two things: http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-70ef09533913577b0e92409e904c54f4b2611f47/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm126-tests1-windows-build22.txt.gz Look around 14:10:03. 1. There's the problem I illustrated above, I added code to report every second what pools were still alive. But there's obvious breaks of several seconds during which SpinUntilEmpty will just sit there and not do anything. That's probably something we should address but not the underlying problem. 2. More importantly the problematic thread pool is a MediaThreadType::PLATFORM_DECODER, which seems to be used by TaskQueue or FlushableTaskQueue.. I'll continue digging into this but it would be really great if someone from the media team actually chimed in here to help a little bit.
As was seen elsewhere, TaskQueue can't release it's last reference from itself generally; not sure if that's involved here. Bas: thanks for tracking it down to a Platform Decoder pool! Adding jya
Flags: needinfo?(jwwang)
(In reply to Randell Jesup [:jesup] from comment #35) > As was seen elsewhere, TaskQueue can't release it's last reference from > itself generally; not sure if that's involved here. > > Bas: thanks for tracking it down to a Platform Decoder pool! > > Adding jya I stand corrected, I copy-paste failed. It's the PLAYBACK pool!
Flags: needinfo?(rjesup)
I wonder if something that is Cycle Collected is holding on to a Task Queue, since ShutdownThreads occurs way before final cycle collection does.
(In reply to Bas Schouten (:bas.schouten) from comment #33) > There seems to be one bug here at least, although at this point I'm not > certain that is the cause here: > > while (!IsEmpty()) { > sMonitor->AssertNotCurrentThreadIn(); > NS_ProcessNextEvent(NS_GetCurrentThread(), true); > } > > This is passing aMayWait as true, which seems fine, but it's possible for > the last SharedThreadPool to be destroyed -off- the main thread without an > event ever being processed by the main thread as far as I can tell. Which > means it would just sit here without ever re-checking the pool size. I have > tests pushed to see if this is the case. SharedThreadPool::Release(void) { sPools->Remove(mName); NS_DispatchToMainThread(NewRunnableMethod(mPool, &nsIThreadPool::Shutdown)); } It posts an event the the main thread after removing the string from the pool. So the situation you described will not happen.
(In reply to Randell Jesup [:jesup] from comment #35) > As was seen elsewhere, TaskQueue can't release it's last reference from > itself generally; not sure if that's involved here. per comment 31: SharedThreadPool::Release(void) { NS_DispatchToMainThread(NewRunnableMethod(mPool, &nsIThreadPool::Shutdown)); } The underlying mPool is always shut down in the main thread. So I think it is safe to release the last ref count from the task queue thread IIUC.
Flags: needinfo?(jwwang)
(In reply to JW Wang [:jwwang] from comment #39) > (In reply to Randell Jesup [:jesup] from comment #35) > > As was seen elsewhere, TaskQueue can't release it's last reference from > > itself generally; not sure if that's involved here. > > per comment 31: > > SharedThreadPool::Release(void) > { > NS_DispatchToMainThread(NewRunnableMethod(mPool, > &nsIThreadPool::Shutdown)); > } > > The underlying mPool is always shut down in the main thread. So I think it > is safe to release the last ref count from the task queue thread IIUC. Ah, yep, you're right, still, that should probably be documented, sort of something that's easily changed without going back to change this. In any case, as I said, not the problem here anyway. The problem here is just some media object staying alive holding on to a taskqueue. I've done an additional push to figure out what type of object it is.
I think the problem is whenever you leak a SharedThreadPool, you will have shutdown hang in SharedThreadPool::SpinUntilEmpty() which makes it misleading and hard to debug leaks. I am thinking can we just remove SharedThreadPool::SpinUntilEmpty() and depend on the client code of SharedThreadPool to finish all its jobs before xpcom-shutdown. A class as MediaShutdownManager should definitely ease the pain of management. However, it is still up to the client code to ensure the correct shutdown sequence. Hi Chris, What do you think about it?
(In reply to JW Wang [:jwwang] from comment #41) > I think the problem is whenever you leak a SharedThreadPool, you will have > shutdown hang in SharedThreadPool::SpinUntilEmpty() which makes it > misleading and hard to debug leaks. > > I am thinking can we just remove SharedThreadPool::SpinUntilEmpty() and > depend on the client code of SharedThreadPool to finish all its jobs before > xpcom-shutdown. A class as MediaShutdownManager should definitely ease the > pain of management. However, it is still up to the client code to ensure the > correct shutdown sequence. > > Hi Chris, > What do you think about it? So well, I still think we should address the actual issues here. Maybe timing out in SpinUntilEmpty and logging what got leaked is a useful thing to do. In any case, we are leaking a MediaDecoderReader element. Could that be due to cycle collection?
Flags: needinfo?(jyavenard)
MediaDecoderReaders aren't cycle collected.
Flags: needinfo?(jyavenard)
(In reply to Jean-Yves Avenard [:jya] from comment #43) > MediaDecoderReaders aren't cycle collected. So any idea why they may still be alive during the thread shutdown phase?
Flags: needinfo?(jyavenard)
(In reply to Bas Schouten (:bas.schouten) from comment #42) > So well, I still think we should address the actual issues here. Maybe > timing out in SpinUntilEmpty and logging what got leaked is a useful thing > to do. We should. What I propose is turn a shutdown hang into leaks which are easier to tackle.
(In reply to Bas Schouten (:bas.schouten) from comment #44) > So any idea why they may still be alive during the thread shutdown phase? Which log shows it is still alive?
(In reply to JW Wang [:jwwang] from comment #46) > (In reply to Bas Schouten (:bas.schouten) from comment #44) > > So any idea why they may still be alive during the thread shutdown phase? > Which log shows it is still alive? http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-ad4d795669930d1b3d8b93f86b8dee56cea5a678/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm126-tests1-windows-build24.txt.gz
18:09:56 INFO - SharedThreadPool::SpinUntilEmpty waiting for ThreadPools: 18:09:56 INFO - Alive thread pool: MediaPlayback 18:09:56 INFO - Alive MediaSourceDemuxers: 0 18:09:56 INFO - Alive Benchmarks: 0 18:09:56 INFO - Alive MediaDecoderReaders: 3 18:09:56 INFO - Alive MediaDecoderStateMachines: 0
It is interesting that no MediaDecoderStateMachine alive. The only thing I can think of to keep MediaDecoderReader alive is MediaDecodeTask in webaudio.
Don't know why it could kept alive. Seeing that this is showing up with MSE. If MediaSourceDemuxers count is 0, then we know the MediaFormatReader::Shutdown was called (as it's the only place that kills the demuxers) Is there a way to see if there are any MediaSourceDecoder (or MediaDecoder for that matter) alive. The MediaSourceDecoder is the only object with a strong reference to the MediaFormatReader when running this mochitest (that and the MediaDecoderStateMachine but your logs show that there are none left).
MediaSource also holds a strong reference to the MediaSourceDecoder. MediaSource objects are cycle collected.
Flags: needinfo?(jyavenard)
(In reply to Jean-Yves Avenard [:jya] from comment #51) > MediaSource also holds a strong reference to the MediaSourceDecoder. > > MediaSource objects are cycle collected. I've added logging for MediaSourceDecoder and MediaDecodeTask. Let's see.
can you include MediaSource too ?
(In reply to Jean-Yves Avenard [:jya] from comment #53) > can you include MediaSource too ? Done.
http://archive.mozilla.org/pub/firefox/try-builds/bschouten@mozilla.com-5fbee8b60fb7ab386f27c54a825c12715ba3e323/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm109-tests1-windows-build26.txt.gz There's definitely MediaSource and MediaSourceDecoder still alive at shutdown, we are seeing that TaskQueue::BeginShutdown -is- being called for every MediaDecoderReader though. Apparently that's not enough.
I found a way to repro the hang: 1. apply part1_forece_e10s_tests.patch and part2_debug_leaks.patch 2. push to try with "try: -b d -p win32 -u mochitest-media[Windows XP] -t none[Windows XP]" result: https://treeherder.mozilla.org/logviewer.html#?job_id=20852760&repo=try#L2856 logs: SharedThreadPool::SpinUntilEmpty() -- begin Alive thread pool: MediaPlayback2 Hit MOZ_CRASH() at c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/build/XPCOMInit.cpp:835 MediaPlayback2 is from https://hg.mozilla.org/try/diff/2fa5d5771c7f/dom/media/mediasource/MediaSourceDemuxer.cpp#l1.13. It looks like MediaSourceDemuxer::mTaskQueue is not deleted.
Flags: needinfo?(rjesup)
Flags: needinfo?(jyavenard)
Flags: needinfo?(cpearce)
MediaSourceDemuxer::mTaskQueue is shared across MediaSourceDemuxer and TrackBuffersManager can you reproduce the issue locally or only on try? Why would this hang be only on Windows XP?
Flags: needinfo?(jyavenard)
(In reply to Bas Schouten (:bas.schouten) from comment #55) > http://archive.mozilla.org/pub/firefox/try-builds/bschouten@mozilla.com- > 5fbee8b60fb7ab386f27c54a825c12715ba3e323/try-win32-debug/try_xp_ix- > debug_test-mochitest-media-bm109-tests1-windows-build26.txt.gz > > There's definitely MediaSource and MediaSourceDecoder still alive at > shutdown, we are seeing that TaskQueue::BeginShutdown -is- being called for > every MediaDecoderReader though. Apparently that's not enough. MediaSource is cycle collected and so are SourceBuffer which own a strong reference to TrackBuffersManager which keeps a reference to a AutoTaskQueue which will only shutdowns once no one hold a reference to it. Could it be possible that somehow the last cycle collection that would clear the MediaSource and SourceBuffer is somehow only happening after the task queue are being shutdown? This would cause the behaviour described.
(In reply to Jean-Yves Avenard [:jya] from comment #60) > Could it be possible that somehow the last cycle collection that would clear > the MediaSource and SourceBuffer is somehow only happening after the task > queue are being shutdown? The last CC happens after SharedThreadPool::SpinUntilEmpty(). Therefore, no matter how many times it spins, SourceBuffer will never be CCed. I think need something like MediaShutdownManager to explicitly shut down task queues before SharedThreadPool::SpinUntilEmpty() starts to spin.
(In reply to Jean-Yves Avenard [:jya] from comment #59) > can you reproduce the issue locally or only on try? Only on try. > Why would this hang be only on Windows XP? I guess it is about the timing of GC/CC. If MediaSource/SourceBuffer are CCed before SharedThreadPool::SpinUntilEmpty() starts, AutoTaskQueue will be deleted and TaskQueue will be shut down.
JW, I'm not sure what I'm supposed to see on a try run, I believe I have fixed the issue of the MediaSourceDemuxer::mTaskQueue being kept alive for too long, but all the try runs are crashing still with "Assertion failure: IsInitialized(), at c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\src\gfx\config\gfxFeature.h:80 " I never see the PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::KillHungChildThread(void *)] you had earlier https://treeherder.mozilla.org/#/jobs?repo=try&revision=5362caa2f4b1a8af849a495c458126a46af0eec1 Can I consider the problem fixed then? I have another much cleaner solution on the way that I thought about while riding home.
Flags: needinfo?(jwwang)
Those are bug 1271657.
I think so. I can't find |Alive thread pool: MediaPlayback2| in between SharedThreadPool logs. 04:59:06 INFO - SharedThreadPool::SpinUntilEmpty() -- begin 04:59:06 INFO - SharedThreadPool::SpinUntilEmpty() -- end
Flags: needinfo?(jwwang)
We need to ensure that the MSE TaskQueue gets shutdown as soon as possible and not wait for the MediaSource parent to be destroyed by the cycle collector. XPCOM shutdown will deadlock if any SharedThreadPool are still in use, and it possible for the cycle collector to only occur after xpcom has shutdown. So it's important to ensure mTaskQueue is cleared when the MediaSourceDecoder has been shutdown. This is done by queueing a new DetachTask that will clear mTaskQueue when run. Review commit: https://reviewboard.mozilla.org/r/53018/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/53018/
Attachment #8753121 - Flags: review?(jwwang)
Attachment #8753122 - Flags: review?(jwwang)
P1 let all tasks run until completion, as such we don't need to deal with interrupted tasks anymore. Review commit: https://reviewboard.mozilla.org/r/53020/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/53020/
Comment on attachment 8753121 [details] MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang https://reviewboard.mozilla.org/r/53018/#review49866 ::: dom/media/mediasource/TrackBuffersManager.cpp:213 (Diff revision 1) > // that hasn't been completed. It is possible that a task didn't get processed > // due to the owning SourceBuffer having shutdown. > // We resolve/reject all pending promises and remove all pending tasks from the > // queue. > void > TrackBuffersManager::CancelAllTasks() Do we need to ensure DetachTask is never canceled here?
Attachment #8753121 - Flags: review?(jwwang) → review+
Comment on attachment 8753122 [details] MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang https://reviewboard.mozilla.org/r/53020/#review49868 ::: dom/media/mediasource/TrackBuffersManager.cpp (Diff revision 1) > MOZ_ASSERT(NS_IsMainThread(), "Must be instanciated on the main thread"); > } > > TrackBuffersManager::~TrackBuffersManager() > { > - CancelAllTasks(); Since CancelAllTasks() is removed, can we assert no tasks are left here?
Attachment #8753122 - Flags: review?(jwwang) → review+
Assignee: jwwang → jyavenard
Comment on attachment 8753122 [details] MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53020/diff/1-2/
Backed out for frequent crash in mediasource-append-buffer.html [@ mozilla::TrackBuffersManager::ProcessTasks()] on OSX 10.10 after hitting its new assertion. https://hg.mozilla.org/integration/mozilla-inbound/rev/98eef4a11e96 https://hg.mozilla.org/integration/mozilla-inbound/rev/659a951ea9ee Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=14b8c2f4764f777109e78c289bdf329d1d83321a Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=27993011&repo=mozilla-inbound 21:43:21 INFO - TEST-START | /media-source/mediasource-append-buffer.html 21:43:21 INFO - PROCESS | 1670 | Assertion failure: mQueue.Length() == 0 (Detach task must be the last), at /builds/slave/m-in-m64-000000000000000000000/build/src/dom/media/mediasource/TrackBuffersManager.cpp:197 21:43:21 INFO - PROCESS | 1670 | #01: _ZN7mozilla9TaskQueue6Runner3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xbe7a9] 21:43:21 INFO - PROCESS | 1670 | #02: _ZN12nsThreadPool3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc6174] 21:43:21 INFO - PROCESS | 1670 | #03: _ZThn8_N12nsThreadPool3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc62bd] 21:43:21 INFO - PROCESS | 1670 | #04: _ZN8nsThread16ProcessNextEventEbPb[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc4057] 21:43:21 INFO - PROCESS | 1670 | #05: _Z19NS_ProcessNextEventP9nsIThreadb[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xf1fc3] 21:43:21 INFO - PROCESS | 1670 | #06: _ZN7mozilla3ipc28MessagePumpForNonMainThreads3RunEPN4base11MessagePump8DelegateE[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0x64235b] 21:43:21 INFO - PROCESS | 1670 | #07: _ZN11MessageLoop3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0x5d9c29] 21:43:21 INFO - PROCESS | 1670 | #08: _ZN8nsThread10ThreadFuncEPv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc28af] 21:43:21 INFO - PROCESS | 1670 | #09: _pt_root[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/libnss3.dylib +0x23938a] 21:43:21 INFO - PROCESS | 1670 | #10: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x405a] 21:43:21 INFO - PROCESS | 1670 | #11: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3fd7] 21:43:22 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/RURqyEsSR9GRwKzOOzbgBA/artifacts/public/build/firefox-49.0a1.en-US.mac.crashreporter-symbols.zip 21:43:31 INFO - mozcrash Copy/paste: /builds/slave/test/build/macosx64-minidump_stackwalk /var/folders/qj/gk52w8cx1cd0gbstqffnnfqm00000w/T/tmp3ZY5xs.mozrunner/minidumps/0FC56B0A-5CB7-4744-BAC9-D01065183935.dmp /var/folders/qj/gk52w8cx1cd0gbstqffnnfqm00000w/T/tmpF2Z9UQ 21:43:42 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/0FC56B0A-5CB7-4744-BAC9-D01065183935.dmp 21:43:42 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/0FC56B0A-5CB7-4744-BAC9-D01065183935.extra 21:43:43 INFO - PROCESS-CRASH | /media-source/mediasource-append-buffer.html | application crashed [@ mozilla::TrackBuffersManager::ProcessTasks()]
Flags: needinfo?(jyavenard)
This ensures that the tasks are processed in the expected order. Review commit: https://reviewboard.mozilla.org/r/53160/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/53160/
Attachment #8753121 - Attachment description: MozReview Request: Bug 1264694: [MSE] P1. Clear mTaskQueue early when no longer required. r?jwwang → MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang
Attachment #8753122 - Attachment description: MozReview Request: Bug 1264694: [MSE] P2. Remove no longer necessay methods. r?jwwang → MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang
Attachment #8753285 - Flags: review?(jwwang)
Comment on attachment 8753121 [details] MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53018/diff/1-2/
Comment on attachment 8753122 [details] MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53020/diff/2-3/
Attachment #8753285 - Flags: review?(jwwang) → review+
Comment on attachment 8753285 [details] MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang https://reviewboard.mozilla.org/r/53160/#review49952
Comment on attachment 8753285 [details] MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53160/diff/1-2/
Comment on attachment 8753121 [details] MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53018/diff/2-3/
Comment on attachment 8753122 [details] MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53020/diff/3-4/
Flags: needinfo?(jyavenard)
Blocks: 1268559
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Comment on attachment 8753285 [details] MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang Approval Request Comment [Feature/regressing bug #]:1264694 [User impact if declined]: shutdown hang / crash [Describe test coverage new/current, TreeHerder]: in central, not reported since landed [Risks and why]: low, objects are properly shutdown when needed. [String/UUID change made/needed]: none
Attachment #8753285 - Flags: approval-mozilla-aurora?
Request is for all patches
Comment on attachment 8753285 [details] MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang Fixes to intermittent failures and shutdown hangs/crashes, Aurora48+
Attachment #8753285 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 1407940
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: