Closed Bug 864255 Opened 11 years ago Closed 11 years ago

cubeb use-after-free of mMonitor during shutdown [@audiounit_output_callback]

Categories

(Core :: Audio/Video, defect, P1)

x86_64
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla24
Tracking Status
firefox21 --- disabled
firefox22 + fixed
firefox23 + fixed
firefox24 --- fixed
firefox-esr17 --- unaffected
b2g18 --- unaffected

People

(Reporter: posidron, Assigned: jesup)

References

Details

(Keywords: crash, csectype-uaf, Whiteboard: [webrtc][blocking-webrtc+][qa-][adv-main22-])

Crash Data

Attachments

(10 files, 2 obsolete files)

2.11 KB, text/plain
Details
31.90 KB, text/plain
Details
3.59 MB, text/plain
Details
2.40 KB, patch
jesup
: review+
Details | Diff | Splinter Review
97.93 KB, text/plain
Details
2.04 KB, patch
jesup
: review+
Details | Diff | Splinter Review
17.28 KB, patch
ekr
: review+
Details | Diff | Splinter Review
6.57 KB, patch
jesup
: review+
Details | Diff | Splinter Review
1.24 KB, patch
ekr
: review+
Details | Diff | Splinter Review
78.22 KB, text/plain
Details
Attached file callstack
Not sure what is exactly causing this crash but it happened now three times that ASan is trying to report a UAF during shutdown by randomly closing the Firefox instance while fuzzing JSEP - both times Firefox finished before ASan was able to write-out the trace.

I have attached the full log of the fuzzing session and the partial callstack of ASan (re-use), the stack for allocation and free are missing.

What I am seeing very often is a hang instead of this crash. I will attach the stacks of each thread during the hang as well - perhaps relevant to this crash.


re-use: media/libcubeb/src/cubeb_audiounit.c:65

    static OSStatus
    audiounit_output_callback(...)
    {
    [..]
*     got = stm->data_callback(stm, stm->user_ptr, buf, nframes);


    long
    BufferedAudioStream::DataCallback(void* aBuffer, long aFrames)
    {
*     MonitorAutoLock mon(mMonitor);
    [..]

    void
    Mutex::Lock()
    {
        CallStack callContext = CallStack();

*       CheckAcquire(callContext);
        PR_Lock(mLock);
    [..]

    void
    BlockingResourceBase::CheckAcquire(const CallStack& aCallContext)
    {
*       if (eCondVar == mDDEntry->mType) {
    [..]


Tested with m-i changeset: 129433:a1a2af3cebff
Attached file callstack-hang
Attached file fuzzing-session
Thread 38 in  attachment 740197 [details] indicates we're stuck waiting for QTKit ([QTCaptureSession addInput:error:]).  Internally, I believe that sends native events to the main thread (we've run into this before) and can be blocked because the main thread stops handling native events.  (bsmedberg and I dealt with one of these when closing a session; this may be the inverse of that when opening one.)  For added fun, GDB got confused trying to dump all the thread stacks.

I'm not sure that and the ASAN problem are the same bug, I advise investigating as if they aren't.  (Please clone the bug; the hang problem very likely isn't a security issue, so please strip that one of any discussion of ASAN issues.) 

The ASAN bug appears like it may be a failure-to-shut-down issue, or maybe a "we've told cubeb to play some data, and then MSG and/or the application that gave it the data went away".  (I think we transfer ownersip to MSG when telling it to play, so more likely MSG.) 

It appears WebRTC shutdown was delayed (not sure why) until XPCOM was so far shutdown that we couldn't get handles to MainThread, which blocks a lot of our pathways to shut down subsidiary parts of webrtc, and that may have lead to threads saying alive pumping data into MSG/cubeb past the normal shutdown points.

WARNING: NS_ENSURE_TRUE(mThread != PR_GetCurrentThread()) failed: file /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/threads/nsThread.cpp, line 446
WARNING: nsExceptionService ignoring thread destruction after shutdown: file /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/base/nsExceptionService.cpp, line 167

(and then a bunch of errors about being unable to get MainThread)
Blocks: 864371
So from the stack, it appears the use-after-free is the mMonitor itself.  This implies some type of internal error in cubeb (on Mac at least), where you're getting callbacks for data after (I assume) cubeb has started tearing itself down.  This implies lack of waiting for something to be removed or some type of removal handshake.

Kinetik, can you take a look?
Flags: needinfo?(kinetik)
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift]
Assigning to Adam for investigation (to at least figure out where it failed to shut down or why the shutdown was delayed - waiting for TURN timeout?)  That seem to be the root cause, the UAF in cubeb seems to be an exposed latent bug.
Assignee: nobody → adam
This message is a red herring:

>WARNING: NS_ENSURE_TRUE(mThread != PR_GetCurrentThread()) failed: file /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/threads/nsThread.cpp, line 446

Even without any WebRTC interaction, it will be emitted when the "Cache I/O" thread is shut down; you can verify by changing the NS_ENSURE_TRUE to a MOZ_ASSERT and checking: "print (const char*)PR_GetThreadName(mThread)"

Still digging...
For that matter, the "nsExceptionService ignoring thread destruction after shutdown" warning is always emitted during normal shutdown, regardless of whether any WebRTC code has been invoked.

I do see the log messages that indicate that the PeerConnectionImpl and its friends are being torn down after most of the other shutdown is finished, but that's perfectly normal. Keep in mind that the only way PCs ever go away is via CC/GC. If you shut down the browser before the GC/CC has a chance to run. We've done quite a bit of bulletproofing to make sure this is a safe thing to do (cleaning up PCs in the midst of Raganarök), although it's quite possible that we've missed something. However, I don't see evidence in this log that anything is amiss in the WebRTC handling in general.

Here's what I see in the log:

(1) Normal shutdown of the SIPCC threads at process shutdown (see the various "Destroying Thread #" log messages)
(2) Two vaguely worrying warning messages that arise both with and without WebRTC
(3) "Exit() called redundantly"
(4) Several PCs are harvested by GC/CC and correctly ignore the missing infrastrucuture (cf., "PeerConnectionCtx is already gone. Ignoring...").
(5) Some attempts by someone (not the PC destructors AFAICT, since the calls happen after the destructors have done their job; not anything on the SIPCC threads, since they've already pthread_exit()ed) to access the main thread. I have not been able to replicate this behavior, but I don't yet see anything that implicates WebRTC.
(6) Several DOMWINDOWs are collected
(7) Several DataChannels being closed as part of GC/CC. From what I can tell, these are the "zombies" described by the comment on line 2253 of DataChannel.cpp
(8) A final batch of DOMWINDOW teardowns interspersed with PeerConnectionImpl cleanup; presumably these are the PCs that are still rooted in DOMWINDOWs.

All of which points to normal and expected behavior for a test that creates PCs and then quickly closes the browser. Have I overlooked something here?
Christoph: Can you re-run your fuzzing with the attached patch and see whether it asserts? (And, if it does, post a backtrace here).

After poking at the problem, I think it's possible that (5), above, might be caused by some of our code. I've added asserts to the two locations that we call into NS_DispatchToMainThread() so I can verify whether the messages are due to PC teardown or some other cause.
Flags: needinfo?(cdiehl)
(In reply to Adam Roach [:abr] from comment #9)
> Christoph: Can you re-run your fuzzing with the attached patch and see
> whether it asserts? (And, if it does, post a backtrace here).

I have tried many times now to reproduce the issue but failed. I am also not seeing any of ASSERTS used in your patch.
Flags: needinfo?(cdiehl)
(In reply to Christoph Diehl [:cdiehl] from comment #10)

> I have tried many times now to reproduce the issue but failed. I am also not
> seeing any of ASSERTS used in your patch.


Thanks for checking. Can you double-check to see whether you are getting the "WARNING: NS_ENSURE_TRUE(mMainThread) failed" messages still? If you're getting those with my patch *and* you're getting no asserts, then I'm pretty sure the problem is in non-WebRTC code.
Comment on attachment 741545 [details] [diff] [review]
Ensure main thread dispatch for Pipeline destruction succeeds

I'd like to land this patch so we catch the bad behavior "in the act" if it comes up again (and is, in fact, in our code).
Attachment #741545 - Flags: review?(rjesup)
Attachment #741545 - Flags: review?(rjesup) → review+
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift] → [webrtc][blocking-webrtc+][webrtc-uplift][leave-open]
Patch landed for diagnostics; bug tagged with [leave-open] for root problem:

https://hg.mozilla.org/integration/mozilla-inbound/rev/30cb835b6282
https://hg.mozilla.org/mozilla-central/rev/30cb835b6282
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
[leave-open]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Randell Jesup [:jesup] from comment #4)
> Kinetik, can you take a look?

It looks awfully similar to bug 865413.
Flags: needinfo?(kinetik)
This UAF appears to be the other side of the same problem (input instead of output) as bug 865413.  There are two issues here I assume: one is failing to shut down audio input until "too late", the other is a use-after-free when that happens (that should never happen, regardless of whether we failed to shut down cubeb normally).  I'm going to retarget this bug as being about the UAF of the mMonitor by cubeb.

The other item flagged here is truly a separate bug: a hang trying to shut down while webrtc/qtkit is trying to open a capture channel.  As mentioned above in comment 3, this is likely a "main thread has stopped processing native events in shutdown" bug as we've seen a few times before (see thread 38).  This should be split off as a separate non-sec bug and we should ask smichaud to look at it.  Steven: does this make sense given the callstack?
Assignee: adam → kinetik
Component: WebRTC: Audio/Video → Video/Audio
Depends on: 865413
Flags: needinfo?(smichaud)
QA Contact: jsmith
Summary: WebRTC use-after-free crash during shutdown [@audiounit_output_callback] → cubeb use-after-free of mMonitor during shutdown [@audiounit_output_callback]
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift][leave-open] → [webrtc][blocking-webrtc+][webrtc-uplift]
> The other item flagged here is truly a separate bug: a hang trying
> to shut down while webrtc/qtkit is trying to open a capture channel.
> As mentioned above in comment 3, this is likely a "main thread has
> stopped processing native events in shutdown" bug as we've seen a
> few times before (see thread 38).  This should be split off as a
> separate non-sec bug and we should ask smichaud to look at it.
>
> Steven: does this make sense given the callstack?

The "main thread has stopped processing native events in shutdown"
scenario is possible, but it's only a guess.

Neither this bug's hang stack nor the one at bug 864255 has any
information on the main thread, so we don't really know what's
going on there.
Flags: needinfo?(smichaud)
(Following up comment #18)

In other words, the hang is just as likely to have been caused by thread contention of some sort.

Any way we can get a main thread backtrace during one of these hangs?
> Any way we can get a main thread backtrace during one of these hangs?

Christof?  On a hang, try "thread 1; where;"  Also if you can capture/dump normal stdout/err that might help.  Thanks
Flags: needinfo?(cdiehl)
(In reply to Randell Jesup [:jesup] from comment #17)
> This UAF appears to be the other side of the same problem (input instead of
> output) as bug 865413.  There are two issues here I assume: one is failing
> to shut down audio input until "too late", the other is a use-after-free
> when that happens (that should never happen, regardless of whether we failed
> to shut down cubeb normally).  I'm going to retarget this bug as being about
> the UAF of the mMonitor by cubeb.

It's mDDTEntry, which is owned by a global DeadlockDetector, which is freed when BlockingResourceBase::Shutdown is called.  I'm not sure how cubeb is supposed to defend itself against that in a robust way.  All I can think of is to try to arrange for AudioStream::ShutdownLibrary to be called before BlockingResourceBase::Shutdown is so that cubeb can fire "use after shutdown" asserts.

But as far as I can see, the core bug lies elsewhere: don't leak running cubeb streams until shutdown.
So kinetik is basically saying the actual bug is the other of the two parts I mentioned (not the UAF, but the failure to shut cubeb down earlier) -- which is reasonable.

Bouncing back to abr - but without more info, we're short on traction here.  How strongly to you believe that the asserts will fire if we're going to hit this case? If you think there's a real chance of those asserts firing if we hit this, let's add a opt-build crash MOZ_CRASH from the function if those dispatches fail - crashing under control is better than crashing with a UAF.
Assignee: kinetik → adam
Flags: needinfo?(adam)
Flags: needinfo?(cdiehl)
> #4  0x0000000106bfc6c6 in mozilla::ReentrantMonitor::Wait
>  (this=0x611000001228, interval=4294967295) at BlockingResourceBase.cpp:312

This seems to be the cause of the hang.  Anyone know what it's waiting on?
And by the way, Christof, does the main thread look the same over multiple hangs?
Flags: needinfo?(cdiehl)
(In reply to Steven Michaud from comment #25)
> And by the way, Christof, does the main thread look the same over multiple
> hangs?

Yes, seems like it.
Flags: needinfo?(cdiehl)
> Anyone know what it's waiting on?

Seems to me that the main thread is blocked waiting on thread 8 in this bug's hang stack (attachment 740197 [details]).
Comment on attachment 746629 [details] [diff] [review]
Crash in production builds if we can't shut down properly

This patch adds crashes to production builds under the circumstances that the previous crash asserted for. This is being done for two reasons: (1) we feel it is better to crash the process than expose a potential security hole; and (2) doing so will allow us to collect crash reports from the field when this issue arises, providing more data for us to analyze.

Keep in mind that this bug only arises in shutdown, where browser crashes will have virtually no impact on user experience.
Attachment #746629 - Flags: review?(rjesup)
Steven - 
Actually, MainThread seems to be waiting in the normal eventqueue waiting for *some* thread to respond to Shutdown() with an shutdownAck.  And there's nothing that I see that implicates thread 8; none of the stacks in attachment 740197 [details] have Shutdown() anywhere, but we're missing a few threads.

If it's trying to Shutdown() a thread that's sent a native event to the MainThread, MainThread isn't processing native events anymore, and the Shutdown() will never get processed because we're waiting for the native event to be processed before going back to process the Mozilla event queue - in particular thread 38 seems to be waiting on a native event.

Benjamin and I ran into this before with deleting a qtkit channel.  IIRC the stopping of native even processing is somewhere in ProcessNextEvent (which looks at mIsMainThread).
Attachment #746629 - Flags: review?(rjesup) → review+
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift] → [webrtc][blocking-webrtc+][webrtc-uplift][leave-open]
(In reply to comment #30)

If the main thread is blocked, then yes, native events will stopped being processed.  And they can be processed either by [NSApp run] or (if Gecko code is running above that) nsAppShell::ProcessNextNativeEvent().

But I still think the hang is entirely due to Gecko code.  After all you've got (at least) Gecko code in the main thread waiting on Gecko code in thread 8 (or possibly vice versa).

There may be other things going on, too, that we don't know about, since we're still missing information on some of the threads.  But for a native hang, we need evidence of at least two different threads waiting on some kind of native object -- which we don't have.

Of course we can't know for sure until we have a reliable way to reproduce the hangs (and crashes).
The testcase in bug 864371 is very reliable on my MacBookPro.
It doesn't work at all on mine :-)
(In reply to Steven Michaud from comment #31)
> (In reply to comment #30)
> 
> If the main thread is blocked, then yes, native events will stopped being
> processed.  And they can be processed either by [NSApp run] or (if Gecko
> code is running above that) nsAppShell::ProcessNextNativeEvent().
> 
> But I still think the hang is entirely due to Gecko code.  After all you've
> got (at least) Gecko code in the main thread waiting on Gecko code in thread
> 8 (or possibly vice versa).

I'm confused why you believe it's waiting on thread 8; MainThread is idle, sitting processing normal events, waiting for a shutdownAck.

> 
> There may be other things going on, too, that we don't know about, since
> we're still missing information on some of the threads.  But for a native
> hang, we need evidence of at least two different threads waiting on some
> kind of native object -- which we don't have.

But thread 38 is waiting, and MainThread (from past experience) is no long processing native events at this point, only Mozilla events while waiting for some thread (likely thread 38!) to process a ShutdownAck message.  Look at the nsThread.cpp and nsThreadManager.cpp lines.  Deadlock.

Thread 38 (process 82272):
#0  0x00007fff878f80fa in __psynch_cvwait ()
No symbol table info available.
#1  0x00007fff8b16cfe9 in _pthread_cond_wait ()
No symbol table info available.
#2  0x00007fff8a83f386 in -[NSCondition wait] ()

and

#0  0x00007fff8ec200fa in __psynch_cvwait ()
#1  0x00007fff98e72fe9 in _pthread_cond_wait ()
#2  0x0000000103273a6c in PR_WaitCondVar ()
#3  0x0000000103274dab in PR_Wait ()
#4  0x0000000106bfc6c6 in mozilla::ReentrantMonitor::Wait (this=0x611000001228, interval=4294967295) at BlockingResourceBase.cpp:312
#5  0x0000000106cb7147 in nsEventQueue::GetEvent (this=0x611000001228, mayWait=<value temporarily unavailable, due to optimizations>, result=0x7fff5fbfdd80) at /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/threads/nsEventQueue.cpp:58
#6  0x0000000106cbae6c in nsThread::ProcessNextEvent (this=0x6110000011c0, mayWait=true, result=0x7fff5fbfdf00) at /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/threads/nsThread.cpp:619
#7  0x0000000106bf5285 in NS_ProcessNextEvent (thread=<value temporarily unavailable, due to optimizations>, mayWait=true) at nsThreadUtils.cpp:238
#8  0x0000000106cba44e in nsThread::Shutdown (this=0x6110006ffe40) at /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/threads/nsThread.cpp:474
#9  0x0000000106cbe0ba in nsThreadManager::Shutdown (this=0x10c658140) at /Users/cdiehl/dev/repos/mozilla/mozilla-inbound/xpcom/threads/nsThreadManager.cpp:128

> Of course we can't know for sure until we have a reliable way to reproduce
> the hangs (and crashes).

Since cdiehl can, please feel free to put up some debugs he can run to get any additional info needed.

All of this said: it would be far better if we didn't continue to try to open channels while in shutdown, but it's tricky to get shutdown to happen reliably early enough to get everything torn down, since much of shutdown doesn't wait after firing notifications.  For the other case like this, Benjamin and I ended up using DISPATCH_SYNC to the main thread to do the shutdown of the channel.
I finally get this.

By the time the hang takes place (and probably also the crashes), nsAppShell has been killed (nsAppShell::Exit() has been called), and it's no longer *possible* to process native events.

Somehow QTKit is still waiting for a native event to be processed on the main thread (as a result of a call to -[NSObject performSelectorOnMainThread:...]).  But I wonder if QTKit made this call *after* nsAppShell was killed.  If so, this would have been because Gecko code made a call into QTKit after nsAppShell was killed.
> (3) "Exit() called redundantly"

This is proof of what I said in comment #35.  nsAppShell::Exit() has actually been called more than once.

(Calling nsAppShell::Exit() more than once is actually not a serious problem.  Its code knows how to handle this.)
> But I wonder if QTKit made this call *after* nsAppShell was killed.
> If so, this would have been because Gecko code made a call into
> QTKit after nsAppShell was killed.

Cdiehl, you could confirm (or deny) this hunch by adding logging to
nsAppShell::Exit() and
webrtc::videocapturemodule::VideoCaptureMacQTKit::Init().

I wonder if it would make sense to keep nsAppShell alive (to leak it
on exit), so that we could query whether nsAppShell::Exit() has
already been called?
> For the other case like this, Benjamin and I ended up using
> DISPATCH_SYNC to the main thread to do the shutdown of the channel.

This may work again.  But you'd need to do it *before*
nsAppShell::Exit() is called for the first time.
Keeping nsAppShell around to process native events until we finish shutting down threads would seem to be a good thing, as we've run into this multiple times.  However...

We shutdown all PC's on profile-change-net-teardown and network:offline-about-to-go-offline.  However, PC's may be held (partially) alive longer due to cycles and GC, but we call close() on all of them on the notifications.  However, since we removed the DISPATCH_SYNCs in PeerConnection, this is no longer a synchronous close (though PeerConnection.js asks for it) - media conduits are closed asynchronously - and in fact Dispatch back to mainthread to complete the Close() after shutting down transports on STS thread.  

So, profile-change-net-teardown will queue all the ShutdownMedia's to STS thread (and they may or may not have been processed). It does guarantee that they'll be processed before any following event is done to the STS thread, which likely resolves the requirements of bug 812886 under the assumption that Shutdown of STS and NSS is triggered by an event to STS.  However none of the mainthread cleanup is guaranteed to have happened or even started (DetachMedia_m's on all the MediaStreams and clearing the lists of MediaStreams).

This means that after profile-change-net-teardown, it may be a while before all the PC's are closed and MediaStreams released, and in the meantime, Shutdown may have progressed to the point where operations stop working (like native events stop being processed), even DispatchToMainThread may stop working.

Perhaps a way to synchronously kill (Destroy() but perhaps not delete yet) all the objects using MediaStreams, while firing off asynchronous Shutdowns of the transports that don't require anything more complex than object deletion.  The Destroy()'d objects would release their MediaStreams immediately but continue to exist until the STS thread shutdowns had finished.

Also: if the real fix is risky (to be seen), is there a safe way to wallpaper this, perhaps at the risk of leaking or a even a safe (but rare) shutdown crash instead of hanging?
I wonder if it'd be possible to allow nsAppShell to exit itself?  To have it somehow detect when everything else is gone and turn itself off?
https://hg.mozilla.org/mozilla-central/rev/42e458069da3(In reply to Adam Roach [:abr] from comment #41)
> Second patch (not a fix) landed:

Does that mean the bug should stay open?
This isn't fixed. Flipping back to affected.
In case anyone else tries to figure this out, I wanted to share a quick bit of my bug notes. The key (relevant) function calls that are going on when we close() a PeerConnection are the following (each level of indentation represents another level down on the call stack):

> On main thread:
>   PeerConnection.close()
>     PeerConnectionImpl::Close
>       PeerConnectionImpl::CloseInt
>         PeerConnectionImpl::ShutdownMedia
>           PeerConnectionMedia::SelfDestruct
>             dispatch to STS...
> 
> On STS thread:
>   PeerConnectionMedia::ShutdownMediaTransport_s
>     LocalSourceStreamInfo::DetachTransport_s
>       MediaPipeline::ShutdownTransport_s
>         PipelineTransport::Detach
>     RemoteSourceStreamInfo::DetachTransport_s
>       MediaPipeline::ShutdownTransport_s
>         PipelineTransport::Detach
>     Dispatch to Main...
> 
> On Main thread:
>   PeerConnectionMedia::SelfDestruct_m
>     LocalSourceStreamInfo::DetachMedia_m
>       MediaPipeline::ShutdownMedia_m
>         MediaPipeline::DetachMediaStream
>           MediaStream::RemoveListener
>             MediaStream::RemoveListenerImpl
>               PipelineListener::~PipelineListener
>                 Dispatch to main: MediaSessionConduit::~MediaSessionConduit
>                 (This is where we think things are maybe going wrong)
>     RemoteSourceStreamInfo::DetachMedia_m
>       MediaPipeline::ShutdownMedia_m
>         MediaPipeline::DetachMediaStream
>           MediaStream::RemoveListener
>             MediaStream::RemoveListenerImpl
>               PipelineListener::~PipelineListener
>                 Dispatch to main: MediaSessionConduit::~MediaSessionConduit
>                 (This is where we think things are maybe going wrong)
>     PeerConnectionMedia::Release
I'm looking at moving the DetachMedia_m() calls to the end of the first mainthread sequence (after dispatching to STS), except for ~PipelineListener (since STS may try to access listener_).

This will release the MediaStreams and MSG graph in the main profile-change-net-teardown notification, avoiding them staying alive until "some random time" during late shutdown.

I have a patch I'd like cdiehl to try, since he can reproduce this.  It still needs a little cleanup (removing some commented-out code), and probably some checking there aren't any other races between STS and the pipeline Detaching from the stream, but it's clean in local mochitest runs.
Flags: needinfo?(cdiehl)
Comment on attachment 747535 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

abr: basically the patch you looked over, plus cleaning up the aIsSynchronous stuff, and releasing mMediaStream in DetachMedia
Attachment #747535 - Flags: review?(ekr)
Attachment #747535 - Flags: review?(adam)
Assignee: adam → rjesup
Priority: -- → P1
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift][leave-open] → [webrtc][blocking-webrtc+][webrtc-uplift]
Comment on attachment 747535 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

Review of attachment 747535 [details] [diff] [review]:
-----------------------------------------------------------------

With the caveat that I don't understand how the MediaPipeline hangs together with its various listeners, this sequence re-ordering would seem to ensure that the media streams are shut down during the XPCOM shutdown events. If I understand the genesis of this particular problem, I agree that doing so should eliminate the issue.

I'll leave it to ekr's review to verify that reordering the transport detach versus the media stream detach is a safe thing to do, as I really don't understand this part of the system well enough to evaluate that.
Attachment #747535 - Flags: review?(adam) → review+
Comment on attachment 747535 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

Review of attachment 747535 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/mediapipeline/MediaPipeline.h
@@ +315,5 @@
>    // Called on the main thread.
>    virtual void DetachMediaStream() {
>      ASSERT_ON_THREAD(main_thread_);
>      stream_->RemoveListener(listener_);
> +    // The MediaStreamGraph still has a reference for a while.

"a reference to the listener for a while"
Detach MediaStreams before queuing the transport shutdown; stop video rendering when detaching the streams to avoid callbacks through the Renderer and Listener; and don't drop refs to any of the listeners until the pipelines themselves go away.   Passes ASAN (on Mac) mochitest runs (20+ iterations, about 1/2 of which I hit Cmd-Q while it was in the middle of a peerconnection test).  Also passes ASAN signaling, mediaconduit and mediapipeline unit tests.
Attachment #747535 - Attachment is obsolete: true
Attachment #747535 - Flags: review?(ekr)
Attachment #748365 - Attachment is obsolete: true
Comment on attachment 748368 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

Been running a continuous ASAN mochitest loop for a few hours now.  Probably a few hundred runs; no failures.

Replacing the conduit_ = nullptr with a new DetachRenderer() call seems to resolve all the issues with synchronizing the two.
Attachment #748368 - Flags: review?(ekr)
Comment on attachment 748368 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

Review of attachment 748368 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm with the changes below.

If you think these changes are wrong, pls. circle back with me
and we can discuss.

::: media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
@@ +315,5 @@
> +  {
> +    mPtrViERender->StopRender(mChannel);
> +    mPtrViERender->RemoveRenderer(mChannel);
> +    mEngineRendererStarted = false;
> +    mRenderer = nullptr;

The structure here is that the conduit is the renderer for ViE
and then just thunks to the renderer. The AddRenderer happens
in Init(). So, the result of this code is:

1. If you DetachRenderer(), you can't re-attach b/c no video
will ever be rendered to the conduit.
2. DetachRenderer causes multiple calls to mPtrViERender->RemoveRenderer()

This code should:
1. Be conditional on mRenderer.
2. Not call RemoveRenderer.

If you want to call StopRender() and set mEngineRenderStarter to false as an optimization, that's fine...

::: media/webrtc/signaling/src/mediapipeline/MediaPipeline.h
@@ +107,5 @@
>    // Must be called on the main thread.
>    void ShutdownMedia_m() {
>      ASSERT_ON_THREAD(main_thread_);
>  
> +    // rtp/rtcp_transport_'s are being destroyed asynchronously to this

This isn't the calling sequence in the caller.

Prefer that:
1. The sequence MUST be ShutdownMedia_m()
2. This is enforced by asserts in ShutdownTransport_s()

@@ +489,5 @@
>    virtual void DetachMediaStream() {
>      ASSERT_ON_THREAD(main_thread_);
>  
>      listener_->EndTrack();
> +    // stop generating video; stop referencing PipelineRenderer and listener

This comment is pretty confusing.
Attachment #748368 - Flags: review?(ekr) → review+
(In reply to Eric Rescorla (:ekr) from comment #54)
> Comment on attachment 748368 [details] [diff] [review]
> Move more of PeerConnectionMedia shutdown to occur synchronously
> 
> Review of attachment 748368 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> lgtm with the changes below.

Thanks

> ::: media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
> @@ +315,5 @@
> > +  {
> > +    mPtrViERender->StopRender(mChannel);
> > +    mPtrViERender->RemoveRenderer(mChannel);
> > +    mEngineRendererStarted = false;
> > +    mRenderer = nullptr;
> 
> The structure here is that the conduit is the renderer for ViE
> and then just thunks to the renderer. The AddRenderer happens
> in Init(). So, the result of this code is:
> 
> 1. If you DetachRenderer(), you can't re-attach b/c no video
> will ever be rendered to the conduit.
> 2. DetachRenderer causes multiple calls to mPtrViERender->RemoveRenderer()
> 
> This code should:
> 1. Be conditional on mRenderer.
> 2. Not call RemoveRenderer.
> 
> If you want to call StopRender() and set mEngineRenderStarter to false as an
> optimization, that's fine...

I agree that if we have an AttachRender() and a DetatchRenderer(), it makes sense to make them callable multiple times (you could AttachRenderermultiple times before, but never stop rendering - null wasn't allowed as a renderer). I optimized/cleaned it up more, and simply removed mEngineRenderStarted as it's redundant to mRenderer.

> 
> ::: media/webrtc/signaling/src/mediapipeline/MediaPipeline.h
> @@ +107,5 @@
> >    // Must be called on the main thread.
> >    void ShutdownMedia_m() {
> >      ASSERT_ON_THREAD(main_thread_);
> >  
> > +    // rtp/rtcp_transport_'s are being destroyed asynchronously to this
> 
> This isn't the calling sequence in the caller.
> 
> Prefer that:
> 1. The sequence MUST be ShutdownMedia_m()
> 2. This is enforced by asserts in ShutdownTransport_s()

Done; asserts on stream_ being null in ShutdownTransport_s() to verify media has been shutdown.
 
> @@ +489,5 @@
> >    virtual void DetachMediaStream() {
> >      ASSERT_ON_THREAD(main_thread_);
> >  
> >      listener_->EndTrack();
> > +    // stop generating video; stop referencing PipelineRenderer and listener
> 
> This comment is pretty confusing.

I rewrote it (longer, reworded); it's still a bit confusing I'm afraid
Flags: needinfo?(cdiehl)
Comment on attachment 748615 [details] [diff] [review]
review nit fixes

Review of attachment 748615 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
@@ +59,5 @@
>    if(mPtrViERender)
>    {
> +    if(mRenderer) {
> +      mPtrViERender->StopRender(mChannel);
> +    }

I could set mRenderer = nullptr here.

Also, I'll replace NULL's with nullptr before landing

@@ -289,5 @@
>      MOZ_ASSERT(PR_FALSE);
>      return kMediaConduitInvalidRenderer;
>    }
> -  //Assign the new renderer - overwrites if there is already one
> -  mRenderer = aVideoRenderer;

I suspect this might be a TSAN datarace, just not ever seen because we never actually overwrite them.  I think we should make it required you Detach() before Attach(); or on mRenderer already set we need to StopRender, change it, StartRender.  Dealing with this extant (but latent) issue can be a follow-on.
https://hg.mozilla.org/integration/mozilla-inbound/rev/738b8462acf4

cristoph, can you verify this fixes the original problems?
Flags: needinfo?(cdiehl)
:jesup, will do first thing tomorrow morning.
Flags: needinfo?(cdiehl)
Comment on attachment 748649 [details] [diff] [review]
fix mediapipeline_unittests to match new assertion

The last-second addition of an assertion showed that the mediapipeline unit tests (which try to mimic peerconnection) needed to be revised to match.  The code is actually safe without the assertion, so it was just causing an orange on debug builds.
Attachment #748649 - Flags: review?(ekr)
Comment on attachment 748649 [details] [diff] [review]
fix mediapipeline_unittests to match new assertion

Review of attachment 748649 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #748649 - Flags: review?(ekr) → review+
FYI, since I don't think I put it in this bug:  here's a Try of a close-to-final patch with many retriggers:
https://tbpl.mozilla.org/?tree=Try&rev=4331654245f8
Comment on attachment 748368 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

[Approval Request Comment]
Bug caused by (feature/regressing bug #): likely bug 844493

User impact if declined: (easily) Hangs on some Mac systems if you quit at the "wrong" time.  cdiehl has also provoked a UAF in cubeb due apparently due the MSG and audio outputs remaining active too far into shutdown.  This is a blocker for the WebRTC/PeerConnection feature.

Testing completed (on m-c, etc.): On m-i, expect to merge to m-c in the morning and thus go to Aurora 23.

Risk to taking this patch (and alternatives if risky): fairly low risk, cleans up shutdown paths to be more predictable with regards to timing in the XPCOM shutdown sequence.  Heavy testing locally, both standard-build mochitests, and opt-build ASAN tests.

String or IDL/UUID changes made by this patch: nIPeerConnection UUID changed; this is an internal interface for Peerconnection.js that nothing else should be using (and it's known that nothing does).

NOTE: this request would be for this patch, the nit-fix follow-on and the fix-unittests patch also on this bug
Attachment #748368 - Flags: approval-mozilla-aurora?
Comment on attachment 748615 [details] [diff] [review]
review nit fixes

[Approval Request Comment]
See first patch up for approval in this bug
Attachment #748615 - Flags: review+
Attachment #748615 - Flags: approval-mozilla-aurora?
Comment on attachment 748649 [details] [diff] [review]
fix mediapipeline_unittests to match new assertion

[Approval Request Comment]
See first patch up for approval in this bug
Attachment #748649 - Flags: approval-mozilla-aurora?
I have tested this again with:

m-i changeset: 131779:caed788840dd 

and I can not get an ASan crash anymore, though the hang of bug 864371 remains.

As said, the ASan crash was rarely reproducible during my tests, I can not say for sure that this is fixed now just because I haven't seen it during my latest tests.
cdiehl: can you get updated thread lists, and backtraces on mainthread, MSG thread, MediaManager thread, and then if possible threads 2, 3 and 4?

Also: what exact MacOS and HW and camera/driver do you have where you see this?

Thanks!
Flags: needinfo?(cdiehl)
> (3) "Exit() called redundantly"

As was noticed way back in comment #7, nsAppShell::Exit() is called at least twice.

I wonder if, as a hack, it'd help to get rid of the first call and only leave the second?

No, I don't (yet) know how to do this, but (if need be) I may be able to spend some time towards the end of this week finding out.
:jesup, I have attached the thread information. The system information is inside that file as well.

I don't know which exact camera driver I am using (I don't know where to obtain that information) but it is a MacBook Pro 2012 edition with Retina display.
Flags: needinfo?(cdiehl)
Comment on attachment 748368 [details] [diff] [review]
Move more of PeerConnectionMedia shutdown to occur synchronously

[Approval Request Comment]
See Aurora nomination, since we've moved Aurora to Beta today.  Since there was no merge done Sunday or Monday morning (as there usually is), we'll need uplift to both aurora and beta.
Attachment #748368 - Flags: approval-mozilla-beta?
Comment on attachment 748615 [details] [diff] [review]
review nit fixes

[Approval Request Comment]
see previous
Attachment #748615 - Flags: approval-mozilla-beta?
Comment on attachment 748649 [details] [diff] [review]
fix mediapipeline_unittests to match new assertion

[Approval Request Comment]
See previous
Attachment #748649 - Flags: approval-mozilla-beta?
Cdiehl's results make me doubt that these patches will make any difference to this bug.  Do we have any record of these crashes in Socorro -- something we can use to see if landing these patches makes any difference?
Hangs generally don't show in Socorro.  The cubeb crash was hard to hit even for cdiehl and requires actions most developers/users aren't likely to do (shutting down the browser in the middle of opening a device).  The hang (bug 864371) is something I've reluctantly marked as blocking-, but if we find a safe fix I'd ask for uplift at least to Aurora, and perhaps even to beta
Also, the cubeb thing was an ASAN UAF trigger; that easily might not show in socorro or even if it did might have a bizarre signature
At some point I'd like to follow up some of the ideas I mentioned here about postponing the call to nsAppShell::Exit(), or avoiding it altogether (having nsAppShell turn out the lights when everything else is done).

I'm not sure when that will be.  But when I do I'll ask cdiehl to see if that gets rid of the hangs.  If it does I'll take that as a sign I'm on the right track.

I suspect that the hangs and crashes are closely related, and may even have the same cause.  I suspect the cause is that our (very fragile) shutdown mechanism can do things that require the processing of native events after that's no longer possible -- especially if we're shutting something down (like WebRTC) that makes (or triggers) native calls on multiple threads.
https://hg.mozilla.org/mozilla-central/rev/738b8462acf4
https://hg.mozilla.org/mozilla-central/rev/329620ea8909
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Flags: in-testsuite? → in-testsuite+
Resolution: --- → FIXED
Target Milestone: mozilla23 → mozilla24
Whiteboard: [webrtc][blocking-webrtc+][webrtc-uplift] → [webrtc][blocking-webrtc+][webrtc-uplift][qa-]
Attachment #748368 - Flags: approval-mozilla-beta?
Attachment #748368 - Flags: approval-mozilla-beta+
Attachment #748368 - Flags: approval-mozilla-aurora?
Attachment #748368 - Flags: approval-mozilla-aurora+
Attachment #748615 - Flags: approval-mozilla-beta?
Attachment #748615 - Flags: approval-mozilla-beta+
Attachment #748615 - Flags: approval-mozilla-aurora?
Attachment #748615 - Flags: approval-mozilla-aurora+
Comment on attachment 748649 [details] [diff] [review]
fix mediapipeline_unittests to match new assertion

IPeerConnection.idl won't be used anywhere else, approving
Attachment #748649 - Flags: approval-mozilla-beta?
Attachment #748649 - Flags: approval-mozilla-beta+
Attachment #748649 - Flags: approval-mozilla-aurora?
Attachment #748649 - Flags: approval-mozilla-aurora+
This has non-trivial conflicts on beta.
Whiteboard: [webrtc][blocking-webrtc+][qa-] → [webrtc][blocking-webrtc+][qa-][adv-main22-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: