Closed
Bug 1428390
Opened 3 years ago
Closed 3 years ago
Risk for shutdown deadlock in CamerasChild
Categories
(Core :: WebRTC: Audio/Video, defect, P2)
Tracking
()
RESOLVED
FIXED
mozilla59
Tracking | Status | |
---|---|---|
firefox59 | --- | fixed |
People
(Reporter: pehrsons, Assigned: pehrsons)
References
Details
Attachments
(2 files)
Background: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9707133eea652e900ce4a1a3078351f93ae4026&selectedJob=154339202 Which gives us these facts: > GECKO(2310) | FATAL ERROR: AsyncShutdown timeout in xpcom-will-shutdown Conditions: [{"name":"Media shutdown: blocking on media thread","state":"(none)","filename":"/builds/worker/workspace/build/src/dom/media/MediaManager.cpp","lineNumber":2020,"stack":"Media shutdown"}] And a stack of the media thread (shortened for readability): > [task 2018-01-05T11:17:05.253Z] 11:17:05 INFO - Thread 23 > [task 2018-01-05T11:17:05.253Z] 11:17:05 INFO - 0 libpthread-2.23.so + 0xd360 > [task 2018-01-05T11:17:05.262Z] 11:17:05 INFO - 1 libxul.so!mozilla::CondVar::Wait [BlockingResourceBase.cpp:a9707133eea6 : 604 + 0x9] > [task 2018-01-05T11:17:05.266Z] 11:17:05 INFO - 2 libxul.so!mozilla::camera::CamerasChild::DispatchToParent [Monitor.h:a9707133eea6 : 40 + 0x14] > [task 2018-01-05T11:17:05.271Z] 11:17:05 INFO - 3 libxul.so!mozilla::camera::LockAndDispatch<int>::LockAndDispatch [CamerasChild.cpp:a9707133eea6 : 244 + 0x5] > [task 2018-01-05T11:17:05.276Z] 11:17:05 INFO - 4 libxul.so!mozilla::camera::CamerasChild::NumberOfCaptureDevices [CamerasChild.cpp:a9707133eea6 : 315 + 0x28] > [task 2018-01-05T11:17:05.280Z] 11:17:05 INFO - 5 libxul.so!mozilla::MediaEngineWebRTC::EnumerateDevices [CamerasChild.h:a9707133eea6 : 136 + 0xb] > [task 2018-01-05T11:17:05.282Z] 11:17:05 INFO - 6 libxul.so!mozilla::GetSources [MediaManager.cpp:a9707133eea6 : 1437 + 0x10] > [task 2018-01-05T11:17:05.284Z] 11:17:05 INFO - 7 libxul.so!mozilla::media::LambdaTask<mozilla::MediaManager::EnumerateRawDevices(uint64_t, mozilla::dom::MediaSourceEnum, mozilla::dom::MediaSourceEnum, bool)::<lambda()> >::Run [MediaManager.cpp:a9707133eea6 : 1835 + 0xb] > [task 2018-01-05T11:17:05.286Z] 11:17:05 INFO - 8 libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:a9707133eea6 : 1040 + 0x15] So the media thread is blocked at [1], while waiting for a reply from CamerasParent. My assumption here is that CamerasParent has already shut down. It probably hit [2]. Reading on in CamerasChild, there is code to recover from this scenario at [3]. This looks ok (modulo the comment "Hopefully prevent us from getting stuck on replies that'll never come."), so I'm going to assume that this was not called. Could that happen? Yes it could most certainly! We know that MediaManager's shutdown blocker is active. It's hooked up to "xpcom-will-shutdown", [4]. IPC could be shut down and release the MediaManager blocker, but IPC is hooked up to "xpcom-shutdown-threads", [5]. It's just that "xpcom-will-shutdown" happens before "xpcom-shutdown-threads", so basically the MediaManager shutdown blocker is stuck blocking indefinitely. See [6]. [1] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasChild.cpp#278 [2] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#476 [3] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasChild.cpp#714-717 [4] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/MediaManager.cpp#108 [5] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/ipc/glue/BackgroundImpl.cpp#1392 [6] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/xpcom/build/XPCOMInit.cpp#854,861,877
Comment 1•3 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #0) > My assumption here is that CamerasParent has already shut down. It probably hit [2]. That's probably where it hit, but I'd expect the CamerasParent to be alive here until IPC is torn down later in "xpcom-shutdown-threads", after the MediaManager blocker. I think the culprit is IsShuttingDown() [1] being too aggressive: bool IsShuttingDown() { return !mChildIsAlive || mDestroyed || !mWebRTCAlive; }; Specifically, I suspect CamerasParent is actually still alive, but mWebRTCAlive is false, causing it to be unresponsive. Further digging reveals mWebRTCAlive is being set to false as early as during WILL_SHUTDOWN, which would explain it: 1. CamerasParent::Observe is a WILL_SHUTDOWN observer which calls StopVideoCapture(), https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#175,179 2. StopVideoCapture() dispatches a runnable that calls CloseEngines(), https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#213,216,220 3. CloseEngines() sets mWebRTCAlive = false; which is an atomic, https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#434 4. which causes IsShuttingDown() to begin to return true [1] 5. which trips up CamerasParent::RecvNumberOfCaptureDevices() and tricks it to return NS_ERROR_FAILURE, 6. which causes it to never respond to the Child, making CamerasParent act as dead effectively before it's officially shut down. 7. ActorDestroy() won't save us, because that won't happen until later in SHUTDOWN_THREADS, so we hang in comment 0. If any of this is caused by new code, we should find a regression range. [1] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.h#109
Updated•3 years ago
|
Rank: 15
Priority: -- → P2
Assignee | ||
Comment 2•3 years ago
|
||
We send an event from the child to the parent when the child shuts down, so the parent can clean up (AllDone, [1], handled at [2]). We also send a message from the parent to the child when the parent is destroyed (__delete__, [3]). However, this is not handled by the child, [4]. Could it be as simple as handling this? [1] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/media/systemservices/PCameras.ipdl#89 [2] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/media/systemservices/CamerasParent.cpp#1032 [3] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/media/systemservices/PCameras.ipdl#70 [4] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PCamerasChild.cpp#21
Assignee | ||
Comment 3•3 years ago
|
||
(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #1) > (In reply to Andreas Pehrson [:pehrsons] from comment #0) > > My assumption here is that CamerasParent has already shut down. It probably hit [2]. > > That's probably where it hit, but I'd expect the CamerasParent to be alive > here until IPC is torn down later in "xpcom-shutdown-threads", after the > MediaManager blocker. That would help. But what's to say that CamerasParent is alive until "xpcom-shutdown-threads" in the __child__? Are parent and children performing synchronized shutdown?
Assignee | ||
Comment 4•3 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #2) > We send an event from the child to the parent when the child shuts down, so > the parent can clean up (AllDone, [1], handled at [2]). > > We also send a message from the parent to the child when the parent is > destroyed (__delete__, [3]). However, this is not handled by the child, [4]. > Could it be as simple as handling this? ok, __delete__ is special. We send it on RecvAllDone and it triggers ActorDestroy() of the parent. I'm contemplating adding another message, similar to AllDone but from parent to child, sent when the capture backend has been shut down by xpcom-will-shutdown.
Assignee | ||
Comment 5•3 years ago
|
||
After discussing with jib on slack I agree with him that comment 1 looks correct and might be all that's needed to prevent this deadlock. What's not entire clear to us is how well-synchronized shutdown (ipc in particular) is across parent and child processes, so perhaps there's more to it, but we'll try this first.
Assignee | ||
Updated•3 years ago
|
Assignee: nobody → apehrson
Status: NEW → ASSIGNED
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 8•3 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=37fcbe4822029011a8b81b4e2c05acbd80e5948c Still leaks and timeouts, but that's likely orthogonal. I'll try to fix those as a followup unless they start showing up in tier-1 tests.
Comment 9•3 years ago
|
||
mozreview-review |
Comment on attachment 8940781 [details] Bug 1428390 - Use mChildIsAlive to decide whether to send replies. https://reviewboard.mozilla.org/r/211050/#review216990
Attachment #8940781 -
Flags: review?(jib) → review+
Comment 10•3 years ago
|
||
mozreview-review |
Comment on attachment 8940782 [details] Bug 1428390 - Remove virtual keyword for overrides in PCameras impl. https://reviewboard.mozilla.org/r/211052/#review216992
Attachment #8940782 -
Flags: review?(jib) → review+
Comment 11•3 years ago
|
||
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again. hg error in cmd: hg rebase -s 3dc33a2886f7 -d b7790af77bfd: rebasing 441672:3dc33a2886f7 "Bug 1428390 - Use mChildIsAlive to decide whether to send replies. r=jib" merging dom/media/systemservices/CamerasParent.cpp rebasing 441673:1834923d98cc "Bug 1428390 - Remove virtual keyword for overrides in PCameras impl. r=jib" (tip) merging dom/media/systemservices/CamerasChild.h warning: conflicts while merging dom/media/systemservices/CamerasChild.h! (edit, then use 'hg resolve --mark') unresolved conflicts (see hg resolve, then hg rebase --continue)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 14•3 years ago
|
||
Pushed by pehrsons@gmail.com: https://hg.mozilla.org/integration/autoland/rev/640496b6e7c0 Use mChildIsAlive to decide whether to send replies. r=jib https://hg.mozilla.org/integration/autoland/rev/c75357e07696 Remove virtual keyword for overrides in PCameras impl. r=jib
Comment 15•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/640496b6e7c0 https://hg.mozilla.org/mozilla-central/rev/c75357e07696
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
status-firefox59:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
You need to log in
before you can comment on or make changes to this bug.
Description
•