Closed Bug 1264694 Opened 8 years ago Closed 8 years ago

Frequent WinXP debug e10s test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash

Categories

(Core :: Audio/Video: Playback, defect)

Unspecified
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
e10s + ---
firefox48 --- fixed
firefox49 --- fixed

People

(Reporter: RyanVM, Assigned: jya)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: crash, intermittent-failure, Whiteboard: [e10s-orangeblockers])

Attachments

(5 files)

I think we have some other known media shutdown issues right now, so this may be a dupe. Maybe bug 1245574 is related?

https://treeherder.mozilla.org/logviewer.html#?job_id=166879&repo=ash

06:22:25     INFO -  37 INFO TEST-START | Shutdown
06:22:25     INFO -  38 INFO Passed:  96
06:22:25     INFO -  39 INFO Failed:  0
06:22:25     INFO -  40 INFO Todo:    1
06:22:25     INFO -  41 INFO Mode:    e10s
06:22:25     INFO -  42 INFO Slowest: 4873ms - /tests/dom/media/mediasource/test/test_SplitAppendDelay.html
06:22:25     INFO -  43 INFO SimpleTest FINISHED
06:22:25     INFO -  44 INFO TEST-INFO | Ran 1 Loops
06:22:25     INFO -  45 INFO SimpleTest FINISHED
06:22:25  WARNING -  TEST-UNEXPECTED-FAIL | dom/media/mediasource/test/test_TruncatedDuration.html | application terminated with exit code 1
06:22:25     INFO -  runtests.py | Application ran for: 0:01:38.093000
06:22:25     INFO -  zombiecheck | Reading PID log: c:\docume~1\cltbld~1.t-x\locals~1\temp\tmprbnmbxpidlog
06:22:25     INFO -  ==> process 824 launched child process 2152 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel="824.0.934698618\1414353303" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -sandbox -appdir "C:\slave\test\build\application\firefox\browser"  824 "\\.\pipe\gecko-crash-server-pipe.824" tab)
06:22:25     INFO -  ==> process 824 launched child process 2176 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel="824.5.1207940992\1022539989" -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -sandbox -appdir "C:\slave\test\build\application\firefox\browser"  824 "\\.\pipe\gecko-crash-server-pipe.824" tab)
06:22:25     INFO -  zombiecheck | Checking for orphan process with PID: 2152
06:22:25     INFO -  zombiecheck | Checking for orphan process with PID: 2176
06:22:25     INFO -  mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\docume~1\cltbld~1.t-x\locals~1\temp\tmp4ssegv.mozrunner\minidumps\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp C:\slave\test\build\symbols
06:22:39     INFO -  mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp
06:22:39     INFO -  mozcrash Saved app info as C:\slave\test\build\blobber_upload_dir\77a012f8-ac59-4f94-9a98-0daa12c45fe2.extra
06:22:39  WARNING -  PROCESS-CRASH | dom/media/mediasource/test/test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog]
06:22:39     INFO -  Crash dump filename: c:\docume~1\cltbld~1.t-x\locals~1\temp\tmp4ssegv.mozrunner\minidumps\77a012f8-ac59-4f94-9a98-0daa12c45fe2.dmp
06:22:39     INFO -  Operating system: Windows NT
06:22:39     INFO -                    5.1.2600 Service Pack 3
06:22:39     INFO -  CPU: x86
06:22:39     INFO -       GenuineIntel family 6 model 30 stepping 5
06:22:39     INFO -       8 CPUs
06:22:39     INFO -  Crash reason:  EXCEPTION_BREAKPOINT
06:22:39     INFO -  Crash address: 0x58284f7
06:22:39     INFO -  Process uptime: 98 seconds
06:22:39     INFO -  Thread 41 (crashed)
06:22:39     INFO -   0  xul.dll!mozilla::`anonymous namespace'::RunWatchdog [nsTerminator.cpp:91115264629d : 158 + 0x22]
06:22:39     INFO -      eip = 0x058284f7   esp = 0x19bfff44   ebp = 0x19bfff4c   ebx = 0x00000000
06:22:39     INFO -      esi = 0x0000009e   edi = 0x0000003f   eax = 0x07170998   ecx = 0x0052705d
06:22:39     INFO -      edx = 0x00576140   efl = 0x00000206
06:22:39     INFO -      Found by: given as instruction pointer in context
06:22:39     INFO -   1  nss3.dll!_PR_NativeRunThread [pruthr.c:91115264629d : 397 + 0x6]
06:22:39     INFO -      eip = 0x00bffd72   esp = 0x19bfff54   ebp = 0x19bfff6c
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   2  nss3.dll!pr_root [w95thred.c:91115264629d : 95 + 0xa]
06:22:39     INFO -      eip = 0x00bf3d56   esp = 0x19bfff74   ebp = 0x19bfff78
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   3  ucrtbase.dll!_crt_at_quick_exit + 0x104
06:22:39     INFO -      eip = 0x005262a4   esp = 0x19bfff80   ebp = 0x19bfffb4
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   4  kernel32.dll!BaseThreadStart + 0x37
06:22:39     INFO -      eip = 0x7c80b713   esp = 0x19bfffbc   ebp = 0x19bfffec
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  Thread 0
06:22:39     INFO -   0  ntdll.dll!KiFastSystemCallRet + 0x0
06:22:39     INFO -      eip = 0x7c90e4f4   esp = 0x0012f92c   ebp = 0x0012f990   ebx = 0x0090f9ec
06:22:39     INFO -      esi = 0x00000710   edi = 0x00000000   eax = 0x00002500   ecx = 0x0012f738
06:22:39     INFO -      edx = 0x00002525   efl = 0x00000246
06:22:39     INFO -      Found by: given as instruction pointer in context
06:22:39     INFO -   1  ntdll.dll!ZwWaitForSingleObject + 0xc
06:22:39     INFO -      eip = 0x7c90df3c   esp = 0x0012f930   ebp = 0x0012f990
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   2  kernel32.dll!WaitForSingleObjectEx + 0x8b
06:22:39     INFO -      eip = 0x7c8025db   esp = 0x0012f934   ebp = 0x0012f990
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   3  kernel32.dll!WaitForSingleObject + 0x12
06:22:39     INFO -      eip = 0x7c802542   esp = 0x0012f998   ebp = 0x0012f9a4
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   4  nss3.dll!_PR_MD_WAIT_CV [w95cv.c:91115264629d : 248 + 0xf]
06:22:39     INFO -      eip = 0x00bf1ddf   esp = 0x0012f9ac   ebp = 0x0012f9c0
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   5  nss3.dll!_PR_WaitCondVar [prucv.c:91115264629d : 172 + 0x1a]
06:22:39     INFO -      eip = 0x00bfe654   esp = 0x0012f9c8   ebp = 0x0012f9e0
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   6  nss3.dll!PR_WaitCondVar [prucv.c:91115264629d : 525 + 0xb]
06:22:39     INFO -      eip = 0x00bfe1c9   esp = 0x0012f9e8   ebp = 0x0012fa04
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   7  xul.dll!mozilla::CondVar::Wait(unsigned int) [BlockingResourceBase.cpp:91115264629d : 501 + 0x9]
06:22:39     INFO -      eip = 0x0338c9ee   esp = 0x0012fa0c   ebp = 0x0012fa24
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   8  xul.dll!nsEventQueue::GetEvent(bool,nsIRunnable * *,mozilla::BaseAutoLock<mozilla::Mutex> &) [nsEventQueue.cpp:91115264629d : 55 + 0xa]
06:22:39     INFO -      eip = 0x03362945   esp = 0x0012fa2c   ebp = 0x0012fa38
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -   9  xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:91115264629d : 984 + 0x23]
06:22:39     INFO -      eip = 0x03365421   esp = 0x0012fa40   ebp = 0x0012fb30
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  10  xul.dll!NS_ProcessNextEvent(nsIThread *,bool) [nsThreadUtils.cpp:91115264629d : 290 + 0xd]
06:22:39     INFO -      eip = 0x03394b46   esp = 0x0012fb38   ebp = 0x0012fb44
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  11  xul.dll!mozilla::layers::CompositorBridgeParent::ShutDown() [CompositorBridgeParent.cpp:91115264629d : 648 + 0x9]
06:22:39     INFO -      eip = 0x03e27a0e   esp = 0x0012fb4c   ebp = 0x0012fb80
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  12  xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager *) [XPCOMInit.cpp:91115264629d : 872 + 0x5]
06:22:39     INFO -      eip = 0x0338c0ca   esp = 0x0012fb5c   ebp = 0x0012fb80
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  13  xul.dll!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:91115264629d : 1466 + 0x7]
06:22:39     INFO -      eip = 0x058346b0   esp = 0x0012fb88   ebp = 0x0012fba0
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  14  xul.dll!mozilla::DefaultDelete<ScopedXPCOMStartup>::operator()(ScopedXPCOMStartup *) [UniquePtr.h:91115264629d : 528 + 0xe]
06:22:39     INFO -      eip = 0x05834bcc   esp = 0x0012fba8   ebp = 0x0012fba8
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  15  xul.dll!XREMain::XRE_main(int,char * * const,nsXREAppData const *) [nsAppRunner.cpp:91115264629d : 4463 + 0x12]
06:22:39     INFO -      eip = 0x0583a26d   esp = 0x0012fbb0   ebp = 0x0012fbd0
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  16  xul.dll!XRE_main [nsAppRunner.cpp:91115264629d : 4543 + 0x12]
06:22:39     INFO -      eip = 0x0583cd95   esp = 0x0012fbd8   ebp = 0x0012fce8
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  17  firefox.exe!do_main [nsBrowserApp.cpp:91115264629d : 220 + 0x1c]
06:22:39     INFO -      eip = 0x00402884   esp = 0x0012fcf0   ebp = 0x0012fe88
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  18  firefox.exe!NS_internal_main(int,char * *,char * *) [nsBrowserApp.cpp:91115264629d : 360 + 0xf]
06:22:39     INFO -      eip = 0x00402180   esp = 0x0012fe90   ebp = 0x0012ff34
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  19  firefox.exe!wmain [nsWindowsWMain.cpp:91115264629d : 135 + 0xe]
06:22:39     INFO -      eip = 0x00402cd8   esp = 0x0012ff3c   ebp = 0x0012ff74
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  20  firefox.exe!__scrt_common_main_seh [exe_common.inl : 264 + 0x1d]
06:22:39     INFO -      eip = 0x004055ba   esp = 0x0012ff7c   ebp = 0x0012ffc0
06:22:39     INFO -      Found by: call frame info
06:22:39     INFO -  21  kernel32.dll!BaseProcessStart + 0x23
06:22:39     INFO -      eip = 0x7c817067   esp = 0x0012ffc8   ebp = 0x0012fff0
06:22:39     INFO -      Found by: call frame info
Oh, this is probably a dupe of bug 1261571, which also spiked around the same time. Many of the tests in the mediasource directory are skipped on XP due to lack of MP4 support.
See Also: → 1261571
Rather, bug 1264082 is the one that has spiked in a similar timeframe.
See Also: → 1264082
Anthony/Nical - looks like playback and/or gfx (times out while trying to shutdown Compositor):

06:22:39     INFO -  11  xul.dll!mozilla::layers::CompositorBridgeParent::ShutDown() [CompositorBridgeParent.cpp:91115264629d : 648 + 0x9]
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(ajones)
Fixing bug 1262898 would certainly help a lot here.

Basically there are CompositorThreadHolders that are not destroyed in time, and CompositorParent::ShutDown spins the event loop  until all COmpositorThreadHolders are gone. CompositorThreadHolders are typically kept by top-level IPDL protocols like CompositorBridgeParent, ImageBridgeParent, etc. which need the compositor thread to be alive to operate. Ensuring that these get shut down in time should fix this (in the case of the two protocols I mentioned, we need their ActorDestroy hook to run, which happens when the channel is closed, either because the other side closed it properly, or because the child process is killed).
Depends on: 1262898
Flags: needinfo?(nical.bugzilla)
Bug 1261571 also waiting CompositorBridgeParent::ShutDown(). But it is on linux.
It's actually worse now. We've got an additional |PROCESS-CRASH | dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@ RtlpWaitForCriticalSection + 0x5b]| as well these days.
The frequency does appear to be lower now, however. It's failing ~30% instead of closer to 90% of the time.
See Also: → 1268332
(In reply to Ryan VanderMeulen [:RyanVM] from comment #13)
> It's actually worse now. We've got an additional |PROCESS-CRASH |
> dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@
> RtlpWaitForCriticalSection + 0x5b]| as well these days.

I've spun this off to bug 1268332.
Summary: Nearly permafail WinXP debug test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash → Frequent WinXP debug e10s test_TruncatedDuration.html | application crashed [@ mozilla::`anonymous namespace'::RunWatchdog] on Ash
Whiteboard: [e10s-orangeblockers]
Blake - does JW have time to look into this?
Flags: needinfo?(bwu)
In my queue.
Assignee: nobody → jwwang
Flags: needinfo?(bwu)
I've done try pushes with a lot of logging, it's not entirely clear what's going on here. We really need to get the stack of the child process when it dies. See also: https://treeherder.mozilla.org/logviewer.html#?job_id=20537210&repo=try#L6273
Depends on: 1270172
I've created some code that will crash the child process when it's hung. We can see that this is indeed executing properly. The main process no exits successfully after 30 seconds of the child process being hung, and we're getting stack traces for the child process. I can't pretend to understand yet what's going on inside the child process:

https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387
(In reply to Bas Schouten (:bas.schouten) from comment #20)
> I've created some code that will crash the child process when it's hung. We
> can see that this is indeed executing properly. The main process no exits
> successfully after 30 seconds of the child process being hung, and we're
> getting stack traces for the child process. I can't pretend to understand
> yet what's going on inside the child process:
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387

Thanks Bas!

So the crash seems to indicate that the SharedThreadPool queue is not empty, and shutdown is stalled waiting for it.  Almost certainly this means something is deadlocked or hung elsewhere, causing the shared pool to never empty.  

Threads 18-29 look suspicious: xul.dll!nsThreadPool::Run() which calls CondVar::Wait()

Any good ideas on how to figure out what's blocking SharedPool shutdown?  Perhaps there are some ways to notice/debug "hung" sharedpool states?  Or to dump out which-thread-owns-which-named-mutex at crash-reporter time?
Flags: needinfo?(nfroyd)
(In reply to Randell Jesup [:jesup] from comment #21)
> (In reply to Bas Schouten (:bas.schouten) from comment #20)
> > I've created some code that will crash the child process when it's hung. We
> > can see that this is indeed executing properly. The main process no exits
> > successfully after 30 seconds of the child process being hung, and we're
> > getting stack traces for the child process. I can't pretend to understand
> > yet what's going on inside the child process:
> > 
> > https://treeherder.mozilla.org/logviewer.html#?job_id=20691428&repo=try#L6387
> 
> Thanks Bas!

I've been looking at this a little more as well, the only things that use SharedThreadPools seem to be in Media code. These 'pools' are supposed to be released explicitly. I'm currently looking at the Media code to figure out what could be going wrong. Maybe someone from media land has ideas.
> 
> So the crash seems to indicate that the SharedThreadPool queue is not empty,
> and shutdown is stalled waiting for it.  Almost certainly this means
> something is deadlocked or hung elsewhere, causing the shared pool to never
> empty.  
> 
> Threads 18-29 look suspicious: xul.dll!nsThreadPool::Run() which calls
> CondVar::Wait()
> 
> Any good ideas on how to figure out what's blocking SharedPool shutdown? 
> Perhaps there are some ways to notice/debug "hung" sharedpool states?  Or to
> dump out which-thread-owns-which-named-mutex at crash-reporter time?
Flags: needinfo?(cpearce)
The SharedThreadPools used by webrtc are currently only for the VideoFrameConverter in MediaPipeline, and I doubt that's involved here.
The most suspicious of the two users would appear to be AsyncCubebTask::EnsureThread, I've moved the shutdown clearing there to an earlier phase of shutdown for a try push, and I've done another try push where I simply do some more logging. Hopefully that will give us the information of roughly where the culprit is.
(In reply to Randell Jesup [:jesup] from comment #23)
> The SharedThreadPools used by webrtc are currently only for the
> VideoFrameConverter in MediaPipeline, and I doubt that's involved here.

Yeah I didn't add the WebRTC one in my logging assuming it wouldn't be relevant. I added logging to the two other users that seem like they'd be relevant:

https://dxr.mozilla.org/mozilla-central/search?q=SharedThreadPool%3A%3AGet&redirect=false&case=true
(In reply to Randell Jesup [:jesup] from comment #21)
> Any good ideas on how to figure out what's blocking SharedPool shutdown? 
> Perhaps there are some ways to notice/debug "hung" sharedpool states?  Or to
> dump out which-thread-owns-which-named-mutex at crash-reporter time?

Somebody's holding a ref to the SharedThreadPool; perhaps refcnt logging on SharedThreadPool would tell you who's hanging onto the pool for longer than they should be.  Or perhaps the somebody who's supposed to release the last shared ref at xpcom-shutdown is a later observer than the SharedThreadPool, which means deadlock.
Flags: needinfo?(nfroyd)
So I have two theories here:

1. There's a theoretical race condition where AsyncCubebTask::EnsureThread gets called off the MainThread after Mozilla::KillClearShutdown is called for the ShutdownThreads phase.
2. There's cycle collected objects holding on to a MediaTimer object that's holding on to the SharedThreadPool.
(In reply to Bas Schouten (:bas.schouten) from comment #27)
> So I have two theories here:
> 
> 1. There's a theoretical race condition where AsyncCubebTask::EnsureThread
> gets called off the MainThread after Mozilla::KillClearShutdown is called
> for the ShutdownThreads phase.
> 2. There's cycle collected objects holding on to a MediaTimer object that's
> holding on to the SharedThreadPool.

For the record, I'm pushing logging additions to figure out which one it could be.
[Child 3440] WARNING: NS_ENSURE_TRUE(context) failed: file c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/threads/nsThread.cpp, line 802

This is also in the log which seems somewhat suspicious as it could cause events on a thread to go unprocessed, however I see these a lot in our logs and it seems to be mostly innocent.
(In reply to Bas Schouten (:bas.schouten) from comment #29)
> [Child 3440] WARNING: NS_ENSURE_TRUE(context) failed: file
> c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/threads/
> nsThread.cpp, line 802
> 
> This is also in the log which seems somewhat suspicious as it could cause
> events on a thread to go unprocessed, however I see these a lot in our logs
> and it seems to be mostly innocent.

That's a symptom of ShutdownInternal() early-exiting because (likely) a thread is trying to call Shutdown() on itself, which is a no-no
https://hg.mozilla.org/mozilla-central/file/3461f3cae78495f100a0f7d3d2e0b89292d3ec02/xpcom/threads/SharedThreadPool.cpp#l146

It looks like it is safe to release the last ref count on the task queue thread.

Btw, as  Nathan said, whenever you leak a SharedThreadPool, you will have shutdown hang in SharedThreadPool::SpinUntilEmpty().
(In reply to JW Wang [:jwwang] from comment #31)
> https://hg.mozilla.org/mozilla-central/file/
> 3461f3cae78495f100a0f7d3d2e0b89292d3ec02/xpcom/threads/SharedThreadPool.
> cpp#l146
> 
> It looks like it is safe to release the last ref count on the task queue
> thread.
> 
> Btw, as  Nathan said, whenever you leak a SharedThreadPool, you will have
> shutdown hang in SharedThreadPool::SpinUntilEmpty().

Right.. the key is to figure out who's leaking it.. well, media code somehow most likely since they're the only users.. but how :).

http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-ea8d4359c8383ceedb2ab28b37522a5c2b13dcba/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm110-tests1-windows-build22.txt.gz

Seems to indicate there's no alive MediaTimers at least when the hang occurs.
There seems to be one bug here at least, although at this point I'm not certain that is the cause here:

  while (!IsEmpty()) {
    sMonitor->AssertNotCurrentThreadIn();
    NS_ProcessNextEvent(NS_GetCurrentThread(), true);
  }

This is passing aMayWait as true, which seems fine, but it's possible for the last SharedThreadPool to be destroyed -off- the main thread without an event ever being processed by the main thread as far as I can tell. Which means it would just sit here without ever re-checking the pool size. I have tests pushed to see if this is the case.
So here's a log that gives us two things:

http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-70ef09533913577b0e92409e904c54f4b2611f47/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm126-tests1-windows-build22.txt.gz

Look around 14:10:03.

1. There's the problem I illustrated above, I added code to report every second what pools were still alive. But there's obvious breaks of several seconds during which SpinUntilEmpty will just sit there and not do anything. That's probably something we should address but not the underlying problem.
2. More importantly the problematic thread pool is a MediaThreadType::PLATFORM_DECODER, which seems to be used by TaskQueue or FlushableTaskQueue.. I'll continue digging into this but it would be really great if someone from the media team actually chimed in here to help a little bit.
As was seen elsewhere, TaskQueue can't release it's last reference from itself generally; not sure if that's involved here.

Bas: thanks for tracking it down to a Platform Decoder pool!

Adding jya
Flags: needinfo?(jwwang)
(In reply to Randell Jesup [:jesup] from comment #35)
> As was seen elsewhere, TaskQueue can't release it's last reference from
> itself generally; not sure if that's involved here.
> 
> Bas: thanks for tracking it down to a Platform Decoder pool!
> 
> Adding jya

I stand corrected, I copy-paste failed. It's the PLAYBACK pool!
Flags: needinfo?(rjesup)
I wonder if something that is Cycle Collected is holding on to a Task Queue, since ShutdownThreads occurs way before final cycle collection does.
(In reply to Bas Schouten (:bas.schouten) from comment #33)
> There seems to be one bug here at least, although at this point I'm not
> certain that is the cause here:
> 
>   while (!IsEmpty()) {
>     sMonitor->AssertNotCurrentThreadIn();
>     NS_ProcessNextEvent(NS_GetCurrentThread(), true);
>   }
> 
> This is passing aMayWait as true, which seems fine, but it's possible for
> the last SharedThreadPool to be destroyed -off- the main thread without an
> event ever being processed by the main thread as far as I can tell. Which
> means it would just sit here without ever re-checking the pool size. I have
> tests pushed to see if this is the case.

SharedThreadPool::Release(void)
{
  sPools->Remove(mName);
  NS_DispatchToMainThread(NewRunnableMethod(mPool, &nsIThreadPool::Shutdown));
}

It posts an event the the main thread after removing the string from the pool. So the situation you described will not happen.
(In reply to Randell Jesup [:jesup] from comment #35)
> As was seen elsewhere, TaskQueue can't release it's last reference from
> itself generally; not sure if that's involved here.

per comment 31:

SharedThreadPool::Release(void)
{
  NS_DispatchToMainThread(NewRunnableMethod(mPool, &nsIThreadPool::Shutdown));
}

The underlying mPool is always shut down in the main thread. So I think it is safe to release the last ref count from the task queue thread IIUC.
Flags: needinfo?(jwwang)
(In reply to JW Wang [:jwwang] from comment #39)
> (In reply to Randell Jesup [:jesup] from comment #35)
> > As was seen elsewhere, TaskQueue can't release it's last reference from
> > itself generally; not sure if that's involved here.
> 
> per comment 31:
> 
> SharedThreadPool::Release(void)
> {
>   NS_DispatchToMainThread(NewRunnableMethod(mPool,
> &nsIThreadPool::Shutdown));
> }
> 
> The underlying mPool is always shut down in the main thread. So I think it
> is safe to release the last ref count from the task queue thread IIUC.

Ah, yep, you're right, still, that should probably be documented, sort of something that's easily changed without going back to change this. In any case, as I said, not the problem here anyway. The problem here is just some media object staying alive holding on to a taskqueue. I've done an additional push to figure out what type of object it is.
I think the problem is whenever you leak a SharedThreadPool, you will have shutdown hang in SharedThreadPool::SpinUntilEmpty() which makes it misleading and hard to debug leaks.

I am thinking can we just remove SharedThreadPool::SpinUntilEmpty() and depend on the client code of SharedThreadPool to finish all its jobs before xpcom-shutdown. A class as MediaShutdownManager should definitely ease the pain of management. However, it is still up to the client code to ensure the correct shutdown sequence.

Hi Chris,
What do you think about it?
(In reply to JW Wang [:jwwang] from comment #41)
> I think the problem is whenever you leak a SharedThreadPool, you will have
> shutdown hang in SharedThreadPool::SpinUntilEmpty() which makes it
> misleading and hard to debug leaks.
> 
> I am thinking can we just remove SharedThreadPool::SpinUntilEmpty() and
> depend on the client code of SharedThreadPool to finish all its jobs before
> xpcom-shutdown. A class as MediaShutdownManager should definitely ease the
> pain of management. However, it is still up to the client code to ensure the
> correct shutdown sequence.
> 
> Hi Chris,
> What do you think about it?

So well, I still think we should address the actual issues here. Maybe timing out in SpinUntilEmpty and logging what got leaked is a useful thing to do.

In any case, we are leaking a MediaDecoderReader element. Could that be due to cycle collection?
Flags: needinfo?(jyavenard)
MediaDecoderReaders aren't cycle collected.
Flags: needinfo?(jyavenard)
(In reply to Jean-Yves Avenard [:jya] from comment #43)
> MediaDecoderReaders aren't cycle collected.

So any idea why they may still be alive during the thread shutdown phase?
Flags: needinfo?(jyavenard)
(In reply to Bas Schouten (:bas.schouten) from comment #42)
> So well, I still think we should address the actual issues here. Maybe
> timing out in SpinUntilEmpty and logging what got leaked is a useful thing
> to do.
We should. What I propose is turn a shutdown hang into leaks which are easier to tackle.
(In reply to Bas Schouten (:bas.schouten) from comment #44)
> So any idea why they may still be alive during the thread shutdown phase?
Which log shows it is still alive?
(In reply to JW Wang [:jwwang] from comment #46)
> (In reply to Bas Schouten (:bas.schouten) from comment #44)
> > So any idea why they may still be alive during the thread shutdown phase?
> Which log shows it is still alive?

http://archive.mozilla.org/pub/firefox/try-builds/mozci-bot@mozilla.com-ad4d795669930d1b3d8b93f86b8dee56cea5a678/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm126-tests1-windows-build24.txt.gz
18:09:56     INFO -  SharedThreadPool::SpinUntilEmpty waiting for ThreadPools:
18:09:56     INFO -  Alive thread pool: MediaPlayback
18:09:56     INFO -  Alive MediaSourceDemuxers: 0
18:09:56     INFO -  Alive Benchmarks: 0
18:09:56     INFO -  Alive MediaDecoderReaders: 3
18:09:56     INFO -  Alive MediaDecoderStateMachines: 0
It is interesting that no MediaDecoderStateMachine alive. The only thing I can think of to keep MediaDecoderReader alive is MediaDecodeTask in webaudio.
Don't know why it could kept alive.

Seeing that this is showing up with MSE.

If MediaSourceDemuxers count is 0, then we know the MediaFormatReader::Shutdown was called (as it's the only place that kills the demuxers)

Is there a way to see if there are any MediaSourceDecoder (or MediaDecoder for that matter) alive. The MediaSourceDecoder is the only object with a strong reference to the MediaFormatReader when running this mochitest (that and the MediaDecoderStateMachine but your logs show that there are none left).
MediaSource also holds a strong reference to the MediaSourceDecoder.

MediaSource objects are cycle collected.
Flags: needinfo?(jyavenard)
(In reply to Jean-Yves Avenard [:jya] from comment #51)
> MediaSource also holds a strong reference to the MediaSourceDecoder.
> 
> MediaSource objects are cycle collected.

I've added logging for MediaSourceDecoder and MediaDecodeTask. Let's see.
can you include MediaSource too ?
(In reply to Jean-Yves Avenard [:jya] from comment #53)
> can you include MediaSource too ?

Done.
http://archive.mozilla.org/pub/firefox/try-builds/bschouten@mozilla.com-5fbee8b60fb7ab386f27c54a825c12715ba3e323/try-win32-debug/try_xp_ix-debug_test-mochitest-media-bm109-tests1-windows-build26.txt.gz

There's definitely MediaSource and MediaSourceDecoder still alive at shutdown, we are seeing that TaskQueue::BeginShutdown -is- being called for every MediaDecoderReader though. Apparently that's not enough.
I found a way to repro the hang:
1. apply part1_forece_e10s_tests.patch and part2_debug_leaks.patch
2. push to try with "try: -b d -p win32 -u mochitest-media[Windows XP] -t none[Windows XP]"

result:
https://treeherder.mozilla.org/logviewer.html#?job_id=20852760&repo=try#L2856

logs:
SharedThreadPool::SpinUntilEmpty() -- begin
Alive thread pool: MediaPlayback2
Hit MOZ_CRASH() at c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/xpcom/build/XPCOMInit.cpp:835

MediaPlayback2 is from https://hg.mozilla.org/try/diff/2fa5d5771c7f/dom/media/mediasource/MediaSourceDemuxer.cpp#l1.13.

It looks like MediaSourceDemuxer::mTaskQueue is not deleted.
Flags: needinfo?(rjesup)
Flags: needinfo?(jyavenard)
Flags: needinfo?(cpearce)
MediaSourceDemuxer::mTaskQueue is shared across MediaSourceDemuxer and TrackBuffersManager

can you reproduce the issue locally or only on try?

Why would this hang be only on Windows XP?
Flags: needinfo?(jyavenard)
(In reply to Bas Schouten (:bas.schouten) from comment #55)
> http://archive.mozilla.org/pub/firefox/try-builds/bschouten@mozilla.com-
> 5fbee8b60fb7ab386f27c54a825c12715ba3e323/try-win32-debug/try_xp_ix-
> debug_test-mochitest-media-bm109-tests1-windows-build26.txt.gz
> 
> There's definitely MediaSource and MediaSourceDecoder still alive at
> shutdown, we are seeing that TaskQueue::BeginShutdown -is- being called for
> every MediaDecoderReader though. Apparently that's not enough.

MediaSource is cycle collected and so are SourceBuffer which own a strong reference to TrackBuffersManager which keeps a reference to a AutoTaskQueue which will only shutdowns once no one hold a reference to it.

Could it be possible that somehow the last cycle collection that would clear the MediaSource and SourceBuffer is somehow only happening after the task queue are being shutdown?

This would cause the behaviour described.
(In reply to Jean-Yves Avenard [:jya] from comment #60)
> Could it be possible that somehow the last cycle collection that would clear
> the MediaSource and SourceBuffer is somehow only happening after the task
> queue are being shutdown?

The last CC happens after SharedThreadPool::SpinUntilEmpty(). Therefore, no matter how many times it spins, SourceBuffer will never be CCed. I think need something like MediaShutdownManager to explicitly shut down task queues before SharedThreadPool::SpinUntilEmpty() starts to spin.
(In reply to Jean-Yves Avenard [:jya] from comment #59)
> can you reproduce the issue locally or only on try?
Only on try.

> Why would this hang be only on Windows XP?
I guess it is about the timing of GC/CC. If MediaSource/SourceBuffer are CCed before SharedThreadPool::SpinUntilEmpty() starts, AutoTaskQueue will be deleted and TaskQueue will be shut down.
JW, I'm not sure what I'm supposed to see on a try run, I believe I have fixed the issue of the MediaSourceDemuxer::mTaskQueue being kept alive for too long, but all the try runs are crashing still with "Assertion failure: IsInitialized(), at c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\src\gfx\config\gfxFeature.h:80 "

I never see the PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::KillHungChildThread(void *)]  you had earlier


https://treeherder.mozilla.org/#/jobs?repo=try&revision=5362caa2f4b1a8af849a495c458126a46af0eec1

Can I consider the problem fixed then?

I have another much cleaner solution on the way that I thought about while riding home.
Flags: needinfo?(jwwang)
Those are bug 1271657.
I think so.

I can't find |Alive thread pool: MediaPlayback2| in between SharedThreadPool logs.
04:59:06     INFO -  SharedThreadPool::SpinUntilEmpty() -- begin
04:59:06     INFO -  SharedThreadPool::SpinUntilEmpty() -- end
Flags: needinfo?(jwwang)
We need to ensure that the MSE TaskQueue gets shutdown as soon as possible and not wait for the MediaSource parent to be destroyed by the cycle collector.
XPCOM shutdown will deadlock if any SharedThreadPool are still in use, and it possible for the cycle collector to only occur after xpcom has shutdown.
So it's important to ensure mTaskQueue is cleared when the MediaSourceDecoder has been shutdown.

This is done by queueing a new DetachTask that will clear mTaskQueue when run.

Review commit: https://reviewboard.mozilla.org/r/53018/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/53018/
Attachment #8753121 - Flags: review?(jwwang)
Attachment #8753122 - Flags: review?(jwwang)
P1 let all tasks run until completion, as such we don't need to deal with interrupted tasks anymore.

Review commit: https://reviewboard.mozilla.org/r/53020/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/53020/
Comment on attachment 8753121 [details]
MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang

https://reviewboard.mozilla.org/r/53018/#review49866

::: dom/media/mediasource/TrackBuffersManager.cpp:213
(Diff revision 1)
>  // that hasn't been completed. It is possible that a task didn't get processed
>  // due to the owning SourceBuffer having shutdown.
>  // We resolve/reject all pending promises and remove all pending tasks from the
>  // queue.
>  void
>  TrackBuffersManager::CancelAllTasks()

Do we need to ensure DetachTask is never canceled here?
Attachment #8753121 - Flags: review?(jwwang) → review+
Comment on attachment 8753122 [details]
MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang

https://reviewboard.mozilla.org/r/53020/#review49868

::: dom/media/mediasource/TrackBuffersManager.cpp
(Diff revision 1)
>    MOZ_ASSERT(NS_IsMainThread(), "Must be instanciated on the main thread");
>  }
>  
>  TrackBuffersManager::~TrackBuffersManager()
>  {
> -  CancelAllTasks();

Since CancelAllTasks() is removed, can we assert no tasks are left here?
Attachment #8753122 - Flags: review?(jwwang) → review+
Assignee: jwwang → jyavenard
Comment on attachment 8753122 [details]
MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53020/diff/1-2/
Backed out for frequent crash in mediasource-append-buffer.html [@ mozilla::TrackBuffersManager::ProcessTasks()] on OSX 10.10 after hitting its new assertion.

https://hg.mozilla.org/integration/mozilla-inbound/rev/98eef4a11e96
https://hg.mozilla.org/integration/mozilla-inbound/rev/659a951ea9ee

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=14b8c2f4764f777109e78c289bdf329d1d83321a
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=27993011&repo=mozilla-inbound
21:43:21     INFO - TEST-START | /media-source/mediasource-append-buffer.html
21:43:21     INFO - PROCESS | 1670 | Assertion failure: mQueue.Length() == 0 (Detach task must be the last), at /builds/slave/m-in-m64-000000000000000000000/build/src/dom/media/mediasource/TrackBuffersManager.cpp:197
21:43:21     INFO - PROCESS | 1670 | #01: _ZN7mozilla9TaskQueue6Runner3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xbe7a9]
21:43:21     INFO - PROCESS | 1670 | #02: _ZN12nsThreadPool3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc6174]
21:43:21     INFO - PROCESS | 1670 | #03: _ZThn8_N12nsThreadPool3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc62bd]
21:43:21     INFO - PROCESS | 1670 | #04: _ZN8nsThread16ProcessNextEventEbPb[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc4057]
21:43:21     INFO - PROCESS | 1670 | #05: _Z19NS_ProcessNextEventP9nsIThreadb[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xf1fc3]
21:43:21     INFO - PROCESS | 1670 | #06: _ZN7mozilla3ipc28MessagePumpForNonMainThreads3RunEPN4base11MessagePump8DelegateE[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0x64235b]
21:43:21     INFO - PROCESS | 1670 | #07: _ZN11MessageLoop3RunEv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0x5d9c29]
21:43:21     INFO - PROCESS | 1670 | #08: _ZN8nsThread10ThreadFuncEPv[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/XUL +0xc28af]
21:43:21     INFO - PROCESS | 1670 | #09: _pt_root[/builds/slave/test/build/application/Nightly.app/Contents/MacOS/libnss3.dylib +0x23938a]
21:43:21     INFO - PROCESS | 1670 | #10: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x405a]
21:43:21     INFO - PROCESS | 1670 | #11: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3fd7]
21:43:22     INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/RURqyEsSR9GRwKzOOzbgBA/artifacts/public/build/firefox-49.0a1.en-US.mac.crashreporter-symbols.zip
21:43:31     INFO - mozcrash Copy/paste: /builds/slave/test/build/macosx64-minidump_stackwalk /var/folders/qj/gk52w8cx1cd0gbstqffnnfqm00000w/T/tmp3ZY5xs.mozrunner/minidumps/0FC56B0A-5CB7-4744-BAC9-D01065183935.dmp /var/folders/qj/gk52w8cx1cd0gbstqffnnfqm00000w/T/tmpF2Z9UQ
21:43:42     INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/0FC56B0A-5CB7-4744-BAC9-D01065183935.dmp
21:43:42     INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/0FC56B0A-5CB7-4744-BAC9-D01065183935.extra
21:43:43     INFO - PROCESS-CRASH | /media-source/mediasource-append-buffer.html | application crashed [@ mozilla::TrackBuffersManager::ProcessTasks()]
Flags: needinfo?(jyavenard)
This ensures that the tasks are processed in the expected order.

Review commit: https://reviewboard.mozilla.org/r/53160/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/53160/
Attachment #8753121 - Attachment description: MozReview Request: Bug 1264694: [MSE] P1. Clear mTaskQueue early when no longer required. r?jwwang → MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang
Attachment #8753122 - Attachment description: MozReview Request: Bug 1264694: [MSE] P2. Remove no longer necessay methods. r?jwwang → MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang
Attachment #8753285 - Flags: review?(jwwang)
Comment on attachment 8753121 [details]
MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53018/diff/1-2/
Comment on attachment 8753122 [details]
MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53020/diff/2-3/
Attachment #8753285 - Flags: review?(jwwang) → review+
Comment on attachment 8753285 [details]
MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang

https://reviewboard.mozilla.org/r/53160/#review49952
Comment on attachment 8753285 [details]
MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53160/diff/1-2/
Comment on attachment 8753121 [details]
MozReview Request: Bug 1264694: [MSE] P2. Clear mTaskQueue early when no longer required. r?jwwang

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53018/diff/2-3/
Comment on attachment 8753122 [details]
MozReview Request: Bug 1264694: [MSE] P3. Remove no longer necessay methods. r?jwwang

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53020/diff/3-4/
Flags: needinfo?(jyavenard)
Blocks: 1268559
https://hg.mozilla.org/mozilla-central/rev/1631eb57a6e8
https://hg.mozilla.org/mozilla-central/rev/efd1df69ffc9
https://hg.mozilla.org/mozilla-central/rev/0fe3b306cae5
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Comment on attachment 8753285 [details]
MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang

Approval Request Comment
[Feature/regressing bug #]:1264694
[User impact if declined]: shutdown hang / crash
[Describe test coverage new/current, TreeHerder]: in central, not reported since landed
[Risks and why]: low, objects are properly shutdown when needed. 
[String/UUID change made/needed]: none
Attachment #8753285 - Flags: approval-mozilla-aurora?
Request is for all patches
Comment on attachment 8753285 [details]
MozReview Request: Bug 1264694: [MSE] P1. Ensure we only add source buffer tasks on the task queue. r?jwwang

Fixes to intermittent failures and shutdown hangs/crashes, Aurora48+
Attachment #8753285 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 1407940
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: