Closed Bug 1209987 Opened 4 years ago Closed 4 years ago

webrtc.org Engine creation and destruction should happen on the WebRTC threads

Categories

(Core :: Audio/Video, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla44
Tracking Status
firefox44 --- fixed

People

(Reporter: gcp, Assigned: gcp)

Details

Attachments

(2 files, 3 obsolete files)

<jesup> gcp: The CamerasParent code blows up new thread-safety checks added in 43... :-(  It verifies that you're calling AllocateCaptureDevice() from the ViECaptureThread... and we're calling it from a CamerasParent thread (VideoCapture)
<jesup> gcp: I think module_process_thread_ is created in ViESharedData::ViESharedData() (via ProcessThread::Create())
<jesup> yes, pbackground: CamerasParent.cpp:275   helper->mEngine = webrtc::VideoEngine::Create(helper->mConfig);

What this means is that the creation and destruction of the webrtc.org engines should be done on the same thread as the one the actual calls are dispatched to.

We currently don't do this, consequently they're on PBackground:
https://dxr.mozilla.org/mozilla-central/source/dom/media/systemservices/CamerasParent.cpp#236
Assignee: nobody → gpascutto
This also slightly simplifies the shutdown sequence to make it clear
what happens where. Quite a bit of thread and ref juggling but no way
around that.
Attachment #8670739 - Flags: review?(rjesup)
So it looks like we can get rid of all the locking if we can run everything on the WebRTC thread. 

Unfortunately there's this:
https://bugzilla.mozilla.org/show_bug.cgi?id=1070216#c32

I'm going to see if that is still needed.
Comment on attachment 8670739 [details] [diff] [review]
webrtc.org Engine creation and destruction should happen on the WebRTC threads

Review of attachment 8670739 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/systemservices/CamerasParent.cpp
@@ +513,5 @@
>  CamerasParent::RecvGetCaptureDevice(const int& aCapEngine,
>                                      const int& aListNumber)
>  {
>    LOG((__PRETTY_FUNCTION__));
> +  LOG(("RecvGetCaptureDevice"));

redundant

@@ +575,3 @@
>        int error = -1;
> +      if (!self->EnsureInitialized(aCapEngine)) {
> +        LOG(("Fails to initialize"));

How about (for all of these) LOG(("Failed to initialize in " __FUNCTION__)) or "Failed to initialize in CamerasParent"

@@ +744,5 @@
> +              self->mEngines[aCapEngine].mEngineIsRunning = false;
> +            }
> +          }
> +          MutexAutoLock lock(self->mCallbackMutex);
> +          for (unsigned int i = 0; i < self->mCallbacks.Length(); i++) {

size_t perhaps

@@ +851,5 @@
> +
> +  // We need to shut down the video capture thread, but in
> +  // all likelyhood we're running on it (when the runnable
> +  // in ActorDestroy derefs self). Dispatch the stop to the main
> +  // thread (PBackground which started it is likely gone by now).

Thread shutdown has to occur from the starting thread I think, at least on some OSes (check with froyd)
Attachment #8670739 - Flags: review?(rjesup) → review+
Removed locks. Addressed review comments.
Attachment #8670813 - Flags: review?(rjesup)
Attachment #8670739 - Attachment is obsolete: true
(In reply to Randell Jesup [:jesup] from comment #3)
> Thread shutdown has to occur from the starting thread I think, at least on
> some OSes (check with froyd)

The VideoCaptureThread (base::thread) is created when PBackground creates the CamerasParent class. I presume that (the creation) is on the PBackground thread. When ActorDestroy gets called, we need to shut down the WebRTC stuff on the VideoCaptureThread. To do this the relevant shutdown runnable needs to take a reference to CamerasParent, and when the shutdown sequence chair-dance ends, it's going to be the one holding the last ref to it because PBackground is gone by then. So ~CamerasParent is now invoked on the VideoCaptureThread and somehow has to get rid of itself. Oops.

This is tricky. I can think of 3 ways (that I dare post here):

1) Synchronously dispatch the WebRTC shutdown tasks to the VideoCaptureThread, while sitting on the PBackground thread in ActorDestroy, and then do the destruction of the thread on PBackground. This is a little evil, and I haven't tried if it works, but I suspect it might. This causes PBackground to block on WebRTC shutdown.
2) Do the thread creation on PBackground, and in the destructor Dispatch the thread stop/deletion to the mainthread. This is the cleanest. But as stated above jesup isn't sure it's allowed, I'm not either, and froydnj responded "dare I ask why you even want to know this?"
3) Somehow do the thread creation on mainthread. You really want to have the thread up by the time the CamerasParent ctr finishes though, because otherwise all the incoming PBackground traffic must check if that thread is up (and hold if it's not done yet? that brings us back to 1).
I ended up implementing (3). I think it's what I'll least regret later.
Attachment #8670839 - Flags: review?(rjesup)
Attachment #8670813 - Attachment is obsolete: true
Attachment #8670813 - Flags: review?(rjesup)
Need to test how much (hopefully nothing) breaks after bug Bug 1070216 lands.
Attachment #8670865 - Flags: review?(rjesup)
Comment on attachment 8670839 [details] [diff] [review]
webrtc.org Engine creation and destruction should happen on the WebRTC threads

Review of attachment 8670839 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the mVideoCaptureThread fully within the monitor lock

::: dom/media/systemservices/CamerasParent.cpp
@@ +629,2 @@
>          cbh = self->mCallbacks.AppendElement(
> +            new CallbackHelper(static_cast<CaptureEngine>(aCapEngine), capnum, self));

Is this just extra indent?

@@ +812,5 @@
> +  if (mVideoCaptureThread) {
> +    base::Thread *thread = mVideoCaptureThread;
> +    {
> +      MonitorAutoLock lock(mThreadMonitor);
> +      mVideoCaptureThread = nullptr;

*thread = mVideoCaptureThread; 
That should be within the lock, no?  And the test.
If mVideoCaptureThread is referenced from multiple threads (and the lock says it is), then all access must be under lock or otherwise guaranteed not to have TSAN issues -- or make it Atomic, which might be a pain perhaps.
Attachment #8670839 - Flags: review?(rjesup) → review+
(In reply to Randell Jesup [:jesup] from comment #10)
> @@ +812,5 @@
> > +  if (mVideoCaptureThread) {
> > +    base::Thread *thread = mVideoCaptureThread;
> > +    {
> > +      MonitorAutoLock lock(mThreadMonitor);
> > +      mVideoCaptureThread = nullptr;
> 
> *thread = mVideoCaptureThread; 
> That should be within the lock, no?  And the test.
> If mVideoCaptureThread is referenced from multiple threads (and the lock
> says it is), then all access must be under lock or otherwise guaranteed not
> to have TSAN issues -- or make it Atomic, which might be a pain perhaps.

Well, the thread that this is on is the only one writing to it, so I think the *reads* in the same thread can safely go outside the lock.
Failing on try:

 10:21:34     INFO -  [1882] ###!!! ASSERTION: Failed NS_DispatchToMainThread() in shutdown; leaking: 'false', file /builds/slave/try-lx-d-000000000000000000000/build/src/xpcom/glue/nsThreadUtils.cpp, line 192
 10:21:34     INFO -  #01: NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int) [xpcom/glue/nsThreadUtils.cpp:192]
 10:21:34     INFO -  #02: NS_DispatchToMainThread(nsIRunnable*, unsigned int) [xpcom/glue/nsThreadUtils.cpp:209]
 10:21:34     INFO -  #03: mozilla::camera::CamerasParent::~CamerasParent() [dom/media/systemservices/CamerasParent.cpp:826]
 10:21:34     INFO -  #04: mozilla::camera::CamerasParent::~CamerasParent() [memory/mozalloc/mozalloc.h:210]
 10:21:34     INFO -  #05: mozilla::camera::CamerasParent::Release() [dom/media/systemservices/CamerasParent.h:78]
 10:21:34     INFO -  #06: mozilla::media::LambdaRunnable<mozilla::camera::CamerasParent::ActorDestroy(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)::<lambda()> >::~LambdaRunnable [xpcom/glue/nsThreadUtils.h:230]

The comments in nsThreadUtils.cpp seem to suggest that this Is A Bad Idea.
Attachment #8670865 - Flags: review?(rjesup) → review+
Sorry for yet another review, but as you can see above all the 
past approaches didn't actually work properly.

We now observe for xpcom-shutdown, dispatch the WebRTC engine
closing to the WebRTC thread, and sit on a monitor (i.e. it's a blocking,
sync dispatch) for that to finish, after which we dispatch the WebRTC
thread shutdown to the main thread (which might not be a real dispatch
if the shutdown came from xpcom-shutdown since we're in the main thread then,
but we must also handle the case where PBackground shuts down our actor). 
We must block hard because once in xpcom-shutdown, the runnable in the WebRTC
thread might  not get a chance to dispatch the thread close to the main thread
before we're in xpcom-shutdown-threads, at which point the main thread is no 
longer  accessible, causing an inevitable thread leak.

The proper machinery has been added to distinguish the two additional
states, i.e. "WebRTC thread still being spun up" and "WebRTC is being
shut down, thread may go away soon".
Attachment #8671955 - Flags: review?(rjesup)
Attachment #8670839 - Attachment is obsolete: true
Comment on attachment 8671955 [details] [diff] [review]
webrtc.org Engine creation and destruction should happen on the WebRTC threads

Review of attachment 8671955 [details] [diff] [review]:
-----------------------------------------------------------------

r+ assuming there's no issue with the booleans and RemoveObserver() is added (or point out that it's not needed for some reason), but we may be introducing races depending on the answer about the booleans.  Likely that requires at least Atomic<> or a lock.

::: dom/media/systemservices/CamerasParent.cpp
@@ +151,5 @@
> +                       const char16_t *aData)
> +{
> +  MOZ_ASSERT(!strcmp(aTopic, NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID));
> +  StopVideoCapture();
> +  return NS_OK;

Don't you need to remove the Observer here?

@@ +399,5 @@
> +      LOG(("Being closed down while engine %d is running!", i));
> +    }
> +    if (mEngines[i].mPtrViERender) {
> +      mEngines[i].mPtrViERender->Release();
> +      mEngines[i].mPtrViERender = nullptr;

Perhaps not critical for this patch landing (and an existing issue), but wouldn't it be easier/safer to use ScopedCustomReleasePtr<webrtc::ViEBase>/etc like VideoConduit/AudioConduit do?  Then these become ...mPtrViERender = nullptr;/etc

::: dom/media/systemservices/CamerasParent.h
@@ +142,5 @@
>  
>    // Shutdown handling
>    bool mChildIsAlive;
>    bool mDestroyed;
> +  bool mWebRTCAlive;

What's the thread access story for these three?  What threads can access? Is there locking? (I think no) Perhaps these should be some form of Atomic<bool> (not sure the level of consistency across the set needed)
Attachment #8671955 - Flags: review?(rjesup) → review+
(In reply to Randell Jesup [:jesup] from comment #21)

> >    // Shutdown handling
> >    bool mChildIsAlive;
> >    bool mDestroyed;
> > +  bool mWebRTCAlive;
> 
> What's the thread access story for these three?  What threads can access? Is
> there locking? (I think no) Perhaps these should be some form of
> Atomic<bool> (not sure the level of consistency across the set needed)

mChildIsAlive and mDestroyed should be PBackground-only vars.

It looks like mWebRTCAlive needs a memory barrier in EnsureInitialized (the other access are under the monitor), so making it Atomic is probably easiest.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #11)
> (In reply to Randell Jesup [:jesup] from comment #10)
> > @@ +812,5 @@
> > > +  if (mVideoCaptureThread) {
> > > +    base::Thread *thread = mVideoCaptureThread;
> > > +    {
> > > +      MonitorAutoLock lock(mThreadMonitor);
> > > +      mVideoCaptureThread = nullptr;
> > 
> > *thread = mVideoCaptureThread; 
> > That should be within the lock, no?  And the test.
> > If mVideoCaptureThread is referenced from multiple threads (and the lock
> > says it is), then all access must be under lock or otherwise guaranteed not
> > to have TSAN issues -- or make it Atomic, which might be a pain perhaps.
> 
> Well, the thread that this is on is the only one writing to it, so I think
> the *reads* in the same thread can safely go outside the lock.

Actually, one of the horrors of TSAN is that this is wrong.  Reads from another thread can corrupt the value.  Really. https://blog.mozilla.org/nnethercote/2015/02/24/fix-your-damned-data-races/ and the lastr example from https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
(In reply to Randell Jesup [:jesup] from comment #23)
> Actually, one of the horrors of TSAN is that this is wrong.  Reads from
> another thread can corrupt the value.  Really.
> https://blog.mozilla.org/nnethercote/2015/02/24/fix-your-damned-data-races/
> and the lastr example from
> https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-
> could-possibly-go-wrong

The relevant ones only apply to stack vars, not in-memory ones, which is guaranteed here. Of course there's the catch-all undefined behavior issue.

Note that the last patch already got rid of that code.
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc7d1acf804f05db4c6cea3f480cbc6bccbce153
Bug 1209987 - webrtc.org Engine creation and destruction should happen on the WebRTC threads. r=jesup
Hrm, this happened twice in a row, and backfilling the gap behind the first instance showed this was in fact the first instance. But then it didn't fail on a new push several pushes later (which didn't finish until I had already backed this out, assuming it was permafailing). So maybe this isn't at fault.

doing a bunch of retriggers to see if I can get it to happen earlier.
This code has a large risk of causing leaks (see all try pushes above), but the current implementation didn't leak on try, and the leak in the logs seems to originate from unrelated audio code.
Flags: needinfo?(gpascutto)
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c88cbf86ea362ddcfa0ae692caa538aaa120640
Bug 1209987 - webrtc.org Engine creation and destruction should happen on the WebRTC threads. r=jesup
Relanded in agreement with the sheriffs.
<nigelb|sheriffduty> Out of 8 test runs, only 2 were orange.

It's possible this modifies shutdown timing so it's exposing bugs in other code.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #31)
> Relanded in agreement with the sheriffs.
> <nigelb|sheriffduty> Out of 8 test runs, only 2 were orange.
> 
> It's possible this modifies shutdown timing so it's exposing bugs in other
> code.

I did a bunch of retriggers on pushes near where you landed, and it leaked just like that linked one well before you landed, so I think you're in the clear in this particular case.
https://hg.mozilla.org/mozilla-central/rev/9c88cbf86ea3
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
You need to log in before you can comment on or make changes to this bug.