Risk for shutdown deadlock in CamerasChild

RESOLVED FIXED in Firefox 59

Status

()

defect
P2
normal
Rank:
15
RESOLVED FIXED
2 years ago
5 months ago

People

(Reporter: pehrsons, Assigned: pehrsons)

Tracking

58 Branch
mozilla59
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox59 fixed)

Details

Attachments

(2 attachments)

Background: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9707133eea652e900ce4a1a3078351f93ae4026&selectedJob=154339202

Which gives us these facts:
> GECKO(2310) | FATAL ERROR: AsyncShutdown timeout in xpcom-will-shutdown Conditions: [{"name":"Media shutdown: blocking on media thread","state":"(none)","filename":"/builds/worker/workspace/build/src/dom/media/MediaManager.cpp","lineNumber":2020,"stack":"Media shutdown"}]

And a stack of the media thread (shortened for readability):
> [task 2018-01-05T11:17:05.253Z] 11:17:05     INFO - Thread 23
> [task 2018-01-05T11:17:05.253Z] 11:17:05     INFO -  0  libpthread-2.23.so + 0xd360
> [task 2018-01-05T11:17:05.262Z] 11:17:05     INFO -  1  libxul.so!mozilla::CondVar::Wait [BlockingResourceBase.cpp:a9707133eea6 : 604 + 0x9]
> [task 2018-01-05T11:17:05.266Z] 11:17:05     INFO -  2  libxul.so!mozilla::camera::CamerasChild::DispatchToParent [Monitor.h:a9707133eea6 : 40 + 0x14]
> [task 2018-01-05T11:17:05.271Z] 11:17:05     INFO -  3  libxul.so!mozilla::camera::LockAndDispatch<int>::LockAndDispatch [CamerasChild.cpp:a9707133eea6 : 244 + 0x5]
> [task 2018-01-05T11:17:05.276Z] 11:17:05     INFO -  4  libxul.so!mozilla::camera::CamerasChild::NumberOfCaptureDevices [CamerasChild.cpp:a9707133eea6 : 315 + 0x28]
> [task 2018-01-05T11:17:05.280Z] 11:17:05     INFO -  5  libxul.so!mozilla::MediaEngineWebRTC::EnumerateDevices [CamerasChild.h:a9707133eea6 : 136 + 0xb]
> [task 2018-01-05T11:17:05.282Z] 11:17:05     INFO -  6  libxul.so!mozilla::GetSources [MediaManager.cpp:a9707133eea6 : 1437 + 0x10]
> [task 2018-01-05T11:17:05.284Z] 11:17:05     INFO -  7  libxul.so!mozilla::media::LambdaTask<mozilla::MediaManager::EnumerateRawDevices(uint64_t, mozilla::dom::MediaSourceEnum, mozilla::dom::MediaSourceEnum, bool)::<lambda()> >::Run [MediaManager.cpp:a9707133eea6 : 1835 + 0xb]
> [task 2018-01-05T11:17:05.286Z] 11:17:05     INFO -  8  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:a9707133eea6 : 1040 + 0x15]


So the media thread is blocked at [1], while waiting for a reply from CamerasParent. My assumption here is that CamerasParent has already shut down. It probably hit [2].

Reading on in CamerasChild, there is code to recover from this scenario at [3]. This looks ok (modulo the comment "Hopefully prevent us from getting stuck on replies that'll never come."), so I'm going to assume that this was not called. Could that happen?

Yes it could most certainly!
We know that MediaManager's shutdown blocker is active. It's hooked up to "xpcom-will-shutdown", [4].
IPC could be shut down and release the MediaManager blocker, but IPC is hooked up to "xpcom-shutdown-threads", [5].

It's just that "xpcom-will-shutdown" happens before "xpcom-shutdown-threads", so basically the MediaManager shutdown blocker is stuck blocking indefinitely. See [6].


[1] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasChild.cpp#278
[2] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#476
[3] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasChild.cpp#714-717
[4] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/MediaManager.cpp#108
[5] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/ipc/glue/BackgroundImpl.cpp#1392
[6] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/xpcom/build/XPCOMInit.cpp#854,861,877
(In reply to Andreas Pehrson [:pehrsons] from comment #0)
> My assumption here is that CamerasParent has already shut down. It probably hit [2].

That's probably where it hit, but I'd expect the CamerasParent to be alive here until IPC is torn down later in "xpcom-shutdown-threads", after the MediaManager blocker.

I think the culprit is IsShuttingDown() [1] being too aggressive:

  bool IsShuttingDown()
  {
    return !mChildIsAlive || mDestroyed || !mWebRTCAlive;
  };

Specifically, I suspect CamerasParent is actually still alive, but mWebRTCAlive is false, causing it to be unresponsive.

Further digging reveals mWebRTCAlive is being set to false as early as during WILL_SHUTDOWN, which would explain it:

 1. CamerasParent::Observe is a WILL_SHUTDOWN observer which calls StopVideoCapture(),
    https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#175,179

 2. StopVideoCapture() dispatches a runnable that calls CloseEngines(),
    https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#213,216,220

 3. CloseEngines() sets mWebRTCAlive = false; which is an atomic,
    https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.cpp#434

 4. which causes IsShuttingDown() to begin to return true [1]
 5. which trips up CamerasParent::RecvNumberOfCaptureDevices() and tricks it to return NS_ERROR_FAILURE,
 6. which causes it to never respond to the Child, making CamerasParent act as dead effectively before it's officially
    shut down.
 7. ActorDestroy() won't save us, because that won't happen until later in SHUTDOWN_THREADS, so we hang in comment 0.

If any of this is caused by new code, we should find a regression range.

[1] https://searchfox.org/mozilla-central/rev/652fbd6270de0d3ec424d2b88f8375ff546f949f/dom/media/systemservices/CamerasParent.h#109
Rank: 15
Priority: -- → P2
We send an event from the child to the parent when the child shuts down, so the parent can clean up (AllDone, [1], handled at [2]).

We also send a message from the parent to the child when the parent is destroyed (__delete__, [3]). However, this is not handled by the child, [4]. Could it be as simple as handling this?


[1] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/media/systemservices/PCameras.ipdl#89
[2] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/media/systemservices/CamerasParent.cpp#1032
[3] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/media/systemservices/PCameras.ipdl#70
[4] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PCamerasChild.cpp#21
(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #1)
> (In reply to Andreas Pehrson [:pehrsons] from comment #0)
> > My assumption here is that CamerasParent has already shut down. It probably hit [2].
> 
> That's probably where it hit, but I'd expect the CamerasParent to be alive
> here until IPC is torn down later in "xpcom-shutdown-threads", after the
> MediaManager blocker.

That would help. But what's to say that CamerasParent is alive until "xpcom-shutdown-threads" in the __child__? Are parent and children performing synchronized shutdown?
(In reply to Andreas Pehrson [:pehrsons] from comment #2)
> We send an event from the child to the parent when the child shuts down, so
> the parent can clean up (AllDone, [1], handled at [2]).
> 
> We also send a message from the parent to the child when the parent is
> destroyed (__delete__, [3]). However, this is not handled by the child, [4].
> Could it be as simple as handling this?

ok, __delete__ is special. We send it on RecvAllDone and it triggers ActorDestroy() of the parent.

I'm contemplating adding another message, similar to AllDone but from parent to child, sent when the capture backend has been shut down by xpcom-will-shutdown.
After discussing with jib on slack I agree with him that comment 1 looks correct and might be all that's needed to prevent this deadlock. What's not entire clear to us is how well-synchronized shutdown (ipc in particular) is across parent and child processes, so perhaps there's more to it, but we'll try this first.
Assignee: nobody → apehrson
Status: NEW → ASSIGNED
https://treeherder.mozilla.org/#/jobs?repo=try&revision=37fcbe4822029011a8b81b4e2c05acbd80e5948c

Still leaks and timeouts, but that's likely orthogonal. I'll try to fix those as a followup unless they start showing up in tier-1 tests.
Comment on attachment 8940781 [details]
Bug 1428390 - Use mChildIsAlive to decide whether to send replies.

https://reviewboard.mozilla.org/r/211050/#review216990
Attachment #8940781 - Flags: review?(jib) → review+
Comment on attachment 8940782 [details]
Bug 1428390 - Remove virtual keyword for overrides in PCameras impl.

https://reviewboard.mozilla.org/r/211052/#review216992
Attachment #8940782 - Flags: review?(jib) → review+
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 3dc33a2886f7 -d b7790af77bfd: rebasing 441672:3dc33a2886f7 "Bug 1428390 - Use mChildIsAlive to decide whether to send replies. r=jib"
merging dom/media/systemservices/CamerasParent.cpp
rebasing 441673:1834923d98cc "Bug 1428390 - Remove virtual keyword for overrides in PCameras impl. r=jib" (tip)
merging dom/media/systemservices/CamerasChild.h
warning: conflicts while merging dom/media/systemservices/CamerasChild.h! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
Pushed by pehrsons@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/640496b6e7c0
Use mChildIsAlive to decide whether to send replies. r=jib
https://hg.mozilla.org/integration/autoland/rev/c75357e07696
Remove virtual keyword for overrides in PCameras impl. r=jib
https://hg.mozilla.org/mozilla-central/rev/640496b6e7c0
https://hg.mozilla.org/mozilla-central/rev/c75357e07696
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
You need to log in before you can comment on or make changes to this bug.