Closed Bug 1182289 Opened 9 years ago Closed 9 years ago

WebrtcGmpVideoEncoder::InitEncode/InitDecode go reentrant on main, opening the door to all kinds of problems

Categories

(Core :: WebRTC: Audio/Video, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox39 --- unaffected
firefox40 + fixed
firefox41 + fixed
firefox42 + fixed
firefox-esr31 --- unaffected
firefox-esr38 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.0M --- unaffected
b2g-v2.1 --- unaffected
b2g-v2.1S --- unaffected
b2g-v2.2 --- unaffected
b2g-v2.2r --- unaffected
b2g-master --- fixed

People

(Reporter: bwc, Assigned: bwc)

References

Details

(Keywords: sec-critical, Whiteboard: [post-critsmash-triage][b2g-adv-main2.5-])

Attachments

(3 files, 8 obsolete files)

55.09 KB, patch
bwc
: review+
Details | Diff | Splinter Review
55.11 KB, patch
Details | Diff | Splinter Review
56.61 KB, patch
abillings
: sec-approval+
Details | Diff | Splinter Review
I've observed this happening from the following stack:

mozilla::PeerConnectionImpl::SetRemoteDescription(int, char const*)  (in XUL) + 3260  [0x101d02e1c]
mozilla::PeerConnectionImpl::SetSignalingState_m(mozilla::dom::PCImplSignalingState, bool)  (in XUL) + 166  [0x101d06266]
mozilla::PeerConnectionMedia::UpdateMediaPipelines(mozilla::JsepSession const&)  (in XUL) + 269  [0x101d12d5d]
mozilla::MediaPipelineFactory::CreateOrUpdateMediaPipeline(mozilla::JsepTrackPair const&, mozilla::JsepTrack const&)  (in XUL) + 301  mozilla::MediaPipelineFactory::GetOrCreateVideoConduit(mozilla::JsepTrackPair const&, mozilla::JsepTrack const&, mozilla::RefPtr<mozilla::MediaSessionConduit>*)  (in XUL) + 4922  [0x101cf3eaa]
mozilla::MediaPipelineFactory::GetOrCreateVideoConduit(mozilla::JsepTrackPair const&, mozilla::JsepTrack const&, mozilla::RefPtr<mozilla::MediaSessionConduit>*)  (in XUL) + 4922  [0x101cf3eaa]
mozilla::WebrtcVideoConduit::ConfigureSendMediaCodec(mozilla::VideoCodecConfig const*)  (in XUL) + 692  [0x101cd70e4]
webrtc::ViECodecImpl::SetSendCodec(int, webrtc::VideoCodec const&)  (in XUL) + 2438  [0x103720da6]                                  
webrtc::ViEEncoder::SetEncoder(webrtc::VideoCodec const&)  (in XUL) + 204  [0x10372168c]
webrtc::vcm::VideoSender::RegisterSendCodec(webrtc::VideoCodec const*, unsigned int, unsigned int)  (in XUL) + 96  [0x103691920]      
webrtc::VCMCodecDataBase::SetSendCodec(webrtc::VideoCodec const*, int, int, webrtc::VCMEncodedFrameCallback*)  (in XUL) + 552  [0x103678478]
webrtc::VCMGenericEncoder::InitEncode(webrtc::VideoCodec const*, int, unsigned int)  (in XUL) + 50  [0x1036788d2]                         
mozilla::WebrtcGmpVideoEncoder::InitEncode(webrtc::VideoCodec const*, int, unsigned int)  (in XUL) + 365  [0x101d2038d]

This allows some more content code to run, which can include things like closing the PeerConnection that is being operated on. Once the queue is flushed, execution will continue on potentially freed objects.

I believe this to be the root cause of bug 1182098, since the same test case triggers this, and bug 1182098 could definitely be caused by the closure of the PeerConnection in the reentrant call.

I am not 100% sure that this could be used to cause a UAF, but it is within the realm of possibility. This is risky enough, and hard enough to analyze, that I'm marking it sec-high. sec-crit could also be justified.
Depending on how long it takes for InitEncode_g to finish running, I see nothing that would stop the full teardown sequence for PeerConnectionImpl/PeerConnectionMedia from completing. If this occurs, we could definitely UAF inside CreateOrUpdateMediaPipeline, since things like |stream| are held as bare pointers.
Rank: 5
Keywords: sec-highsec-critical
Priority: -- → P1
There's a few options I see here:

1. Back out the part of bug 1057908 that replaced the sync dispatches to GMP with async dispatches + reentrant spin. Probably the fastest approach, if it works. Would likely be deadlock prone, but I'm unsure whether it would be any more deadlock prone than what's in release right now. I'll take a possibility of deadlock over a possibility of UAF any day.
2. Make the encoder/decoder init entirely async and give up on being able to return errors to the webrtc.org code. Maybe moderate in difficulty, since we'd need to make sure that when the webrtc.org code tried to use the encoder/decoder, nothing bad happens.
3. Go up a level and async init the GMP encoder/decoder, _then_ call SetSendCodec. We'd need to make sure that we get the lifecycle right, in case the PC gets torn down while these async steps are running.

Jesup, what's your take here? This bug is in Beta right now, and we need to get this fixed and uplifted before this hits release.
Flags: needinfo?(rjesup)
Summary: WebrtcGmpVideoEncoder::InitEncode goes reentrant on main, opening the door to all kinds of problems → WebrtcGmpVideoEncoder::InitEncode/InitDecode go reentrant on main, opening the door to all kinds of problems
It probably makes sense to try option 1... I'm concerned that we might introduce a deadlock late in the process however.  Otherwise, I think we'll need to look at option 3, or find some way to stall the recursion, which may be more "fun".  I don't think we can give up on handling InitEncode/Decode errors, though maybe we could handle them asynchronously (maybe).
Flags: needinfo?(rjesup)
backlog: --- → webRTC+
FWIW, I tried the same test case on release, and it did have a tendency to deadlock in that sync call. I think the thing to do is try doing #1 first and see if things aren't completely broken, uplift that, and then try to implement #3. We might be able to finish #3 in time for an uplift to 40, but I'm not totally sure.
Assignee: nobody → docfaraday
Trying #1 now, but seeing lots of rot. It doesn't look promising.
regarding bug 1181694 (tagged as possible fallout from this): wouldn't this only get triggered if we try to InitEncode/Decode a GMP codec i.e. h.264?  This isn't impossible on Android, but seems unlikely given the volume of crashes.

Also, the bug tagged here as a cause landed in 40 and did not uplift; the crashes definitely include 39, 39bN, and 38.0.5 in the last month.  Further spelunking in crash-stats may show it goes further back, but thus far I haven't seen earlier than 38.0.5 or later than 40b2 or so.  However, aurora/nightly usage on android is low, and I see some indications in the reports that most people hitting it are outside the US... and hints that it might be similar to a bug we hit a while ago where they were using peerconnections to serve some type of "hot women" videos in asia.  That's very soft info, however

Adding link to the bug that caused this and cc-ing peterv
Blocks: 1057908
Definitely goes back to 38.0b4, possibly further (still only looking at 28 days).  No 41 or 42 crashes, but volume on 40bN is low, so that's no real indication the problem has gone away.
Ok, goes back to 37.  I see no crashes in 37bN, just 37 and 37.0.2. It could be the site(s) triggering it didn't show up until this year; there are no crashes before April or so.
Work in progress. Encoder fixup is mostly done, and seems to work. There are a few loose ends though. Have not started on the decoder yet.
Attachment #8634759 - Attachment filename: . → dispatch_fixup.patch
Change of approach, seems cleaner now. Will apply this same approach to the decoder next.
Attachment #8634759 - Attachment is obsolete: true
Apply the same fixup to the decoder, and resolve some TODOs. I need to do some analysis on the mutex locking to be sure, and then improve the logging around error cases.
Attachment #8634964 - Attachment is obsolete: true
Attachment #8635587 - Attachment filename: . → dispatch_fixup.patch
Plumb errors in GMP init back to PeerConnectionImpl, and fix some leaks caused by WebrtcGmpVideoDecoder::InitDoneRunnable spawning spawning nsThreads on webrtc.org threads, and posting stuff to their event loop.
Comment on attachment 8636797 [details] [diff] [review]
Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

There's an awful lot going on in this patch, I'll try to summarize:

1. Created a stub class for GMP-based encoders and decoders for webrtc.org to own, that passes stuff through to the real implementation (which has a refcount). This allows us to async destroy the GMP resources, which solves some deadlock problems.

2. Since the real encoder/decoder implementation's lifetime is decoupled from the webrtc.org stub, we need to be able to shut off Encoded and Decoded callbacks to the handler that webrtc.org registers in a threadsafe way, meaning that we need to protect |mCallback| with a mutex.

3. Made GMP init async.

4. Since GMP init is now async, added some logic to handle the case where webrtc.org called |Release| before init finished (ie; |mInitting|).

5. Fixed data races on |mGMP| and |mCodecParams|.

6. Plumbed GMP init errors back to PeerConnectionImpl (although not in the prettiest way)

7. Fixed cases where errors would cause an GMPVideoEncoder/DecoderProxy to never be closed.
Attachment #8636797 - Attachment filename: . → dispatch_fixup.patch
Attachment #8636797 - Flags: review?(rjesup)
Attachment #8635587 - Attachment is obsolete: true
Attachment #8637232 - Attachment filename: . → dispatch_fixup.patch
Attachment #8637232 - Flags: review?(rjesup)
Attachment #8636797 - Attachment is obsolete: true
Attachment #8636797 - Flags: review?(rjesup)
Comment on attachment 8637232 [details] [diff] [review]
Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

Review of attachment 8637232 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp
@@ +251,5 @@
> +      // This could destroy us, since aGMP may be the last thing holding a ref
> +      // Return immediately.
> +      aGMP->Close();
> +    }
> +    *aErrorOut = "GMP Encode: Either init was aborted, "

Scary, but safe.
Please move to before if() (i.e. before "return immediately") though it's safe as-is.

@@ +297,5 @@
> +  mInitting = false;
> +
> +  if (gmp) {
> +    // Do this last, since this could cause us to be destroyed
> +    gmp->Close();

ditto scary but safe

@@ +727,5 @@
> +      // This could destroy us, since aGMP may be the last thing holding a ref
> +      // Return immediately.
> +      aGMP->Close();
> +    }
> +    *aErrorOut = "GMP Decode: Either init was aborted, "

Ditto move

@@ +769,5 @@
> +  mInitting = false;
> +
> +  if (gmp) {
> +    // Do this last, since this could cause us to be destroyed
> +    gmp->Close();

ditto scary

::: media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.h
@@ +81,4 @@
>    public:
> +    explicit GmpInitDoneRunnable(const std::string& aPCHandle)
> +      : mResult(WEBRTC_VIDEO_CODEC_OK),
> +      mPCHandle(aPCHandle)

Minor nit: initializer list style.  Match others in the file

@@ +105,5 @@
>        mResult = aResult;
> +      mError = aError;
> +      nsCOMPtr<nsIThread> mainThread(do_GetMainThread());
> +      if (mainThread) {
> +        mainThread->Dispatch(this, NS_DISPATCH_NORMAL);

NS_ASSERTION if we fail to get mainthread... This really shouldn't be happening after xpcom-thread-shutdown.
also: do_AddRef(this)

@@ +123,3 @@
>  
> +class WebrtcGmpVideoEncoder : public GMPVideoEncoderCallbackProxy
> +{

general comment on the class - 'override' statements?

@@ +293,5 @@
> +    virtual ~WebrtcVideoEncoderProxy()
> +    {
> +      // TODO: Is this necessary? Are we guaranteed to get a call to Release()
> +      // before this?
> +      Release();

I think if you need to call Release() on webrtc.org things and don't, they typically leak. I'm ok with leaking if that's not called, or MOZ_CRASH even if Release wasn't called.
Attachment #8637232 - Flags: review?(rjesup) → review+
(In reply to Randell Jesup [:jesup] from comment #18)

> @@ +123,3 @@
> >  
> > +class WebrtcGmpVideoEncoder : public GMPVideoEncoderCallbackProxy
> > +{
> 
> general comment on the class - 'override' statements?
> 

   This class doesn't implement WebrtcVideoEncoder, since that API has |Release| in it, and it doesn't mean the same thing as XPCOM uses it for. None of those virtual methods are overriding anything.

> @@ +293,5 @@
> > +    virtual ~WebrtcVideoEncoderProxy()
> > +    {
> > +      // TODO: Is this necessary? Are we guaranteed to get a call to Release()
> > +      // before this?
> > +      Release();
> 
> I think if you need to call Release() on webrtc.org things and don't, they
> typically leak. I'm ok with leaking if that's not called, or MOZ_CRASH even
> if Release wasn't called.

   Let me try removing this. If we see leaks, we can put it back.
Incorporate feedback.
Attachment #8637232 - Attachment is obsolete: true
Comment on attachment 8637474 [details] [diff] [review]
Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

Carry forward r=jesup.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=83a3ca591cf3

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

   This patch is pretty complicated, and fixes lots of security problems. An astute reader could work out the reentrancy problem I suppose, and once that's done it is fairly easy to cause it. I am not sure how easy it would be to exploit, though.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

   Not exactly. It mainly looks like assorted cleanup, and is a fairly complex patch.

Which older supported branches are affected by this flaw?

   40 and later are affected by the reentrancy bug, it is possible that earlier versions are affected by other miscellaneous threadsafety bugs.

If not all supported branches, which bug introduced the flaw?

   Bug 1057908

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

   I have not tried backporting yet. Looking at the log on WebrtcGmpVideoCodec.cpp, it looks like we'll see some rot, but no radical changes have happened there since bug 1057908. I think it will be easy to deal with.

How likely is this patch to cause regressions; how much testing does it need?
Attachment #8637474 - Flags: sec-approval?
Attachment #8637474 - Flags: review+
As for how likely this patch is to cause regressions, there is some risk that a new deadlock has been introduced (although multiple deadlocks have been fixed). Since this also overhauls the lifecycle of WebrtcGmpVideoEncoder/Decoder, there is some risk that I made a mistake in a corner-case somewhere, but we're already dealing with a bug that is at least that bad.
Comment on attachment 8637474 [details] [diff] [review]
Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

sec-approval+ for trunk. Let's get this on Aurora and Beta too if you can nominate or make new patches for those branches.
Attachment #8637474 - Flags: sec-approval? → sec-approval+
(In reply to Randell Jesup [:jesup] from comment #18)
> 
> @@ +105,5 @@
> >        mResult = aResult;
> > +      mError = aError;
> > +      nsCOMPtr<nsIThread> mainThread(do_GetMainThread());
> > +      if (mainThread) {
> > +        mainThread->Dispatch(this, NS_DISPATCH_NORMAL);
> 
> NS_ASSERTION if we fail to get mainthread... This really shouldn't be
> happening after xpcom-thread-shutdown.
> also: do_AddRef(this)

   This doesn't compile on CI. For some reason, the compiler used on CI thinks that |this| is a const pointer, even though we aren't in a const function. clang on my linux box compiles this just fine. Not sure what is going on here, but it seems like a compiler bug or something. Given that it doesn't matter what thread the runnable is destroyed on, does this matter?
(In reply to Byron Campen [:bwc] from comment #24)
> (In reply to Randell Jesup [:jesup] from comment #18)
> > 
> > @@ +105,5 @@
> > >        mResult = aResult;
> > > +      mError = aError;
> > > +      nsCOMPtr<nsIThread> mainThread(do_GetMainThread());
> > > +      if (mainThread) {
> > > +        mainThread->Dispatch(this, NS_DISPATCH_NORMAL);
> > 
> > NS_ASSERTION if we fail to get mainthread... This really shouldn't be
> > happening after xpcom-thread-shutdown.
> > also: do_AddRef(this)
> 
>    This doesn't compile on CI. For some reason, the compiler used on CI
> thinks that |this| is a const pointer, even though we aren't in a const
> function. clang on my linux box compiles this just fine. Not sure what is
> going on here, but it seems like a compiler bug or something. Given that it
> doesn't matter what thread the runnable is destroyed on, does this matter?

There are a couple odd cases where you need to use do_AddRef(static_cast<nsIRunnable*>(this));

That said, I have no idea why it does this - perhaps a unified-build thing (what other includes have been processed)
Attempt to work around compiler bug.
Attachment #8637474 - Attachment is obsolete: true
Comment on attachment 8637988 [details] [diff] [review]
Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

Carry forward r=jesup.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1629c75e08e5
Attachment #8637988 - Attachment filename: . → dispatch_fixup.patch
Attachment #8637988 - Flags: review+
Remove sync dispatch in SetRates, since I caught it deadlocking during some testing.
Comment on attachment 8638066 [details] [diff] [review]
Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

Carry forward r=jesup

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b6d0e08bdcb
Attachment #8638066 - Attachment filename: . → dispatch_fixup.patch
Attachment #8638066 - Flags: review+
Attachment #8637988 - Attachment is obsolete: true
Check back on try
Flags: needinfo?(docfaraday)
Comment on attachment 8638086 [details] [diff] [review]
(Aurora backport) Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

Approval Request Comment
[Feature/regressing bug #]:

   Bug 1057908

[User impact if declined]:

   Malicious (or stupid) sites can cause crashes and perhaps even take advantage of this bug to do various nefarious things.

[Describe test coverage new/current, TreeHerder]:

   The current test coverage is not sufficient to trigger this. There is a test-case on bug 1182098 that can trigger this bug reliably, but it also has a tendency to cause crashes in other parts of the code. We are probably not yet at a point where things are stable enough in the face of this kind of test to automate, sadly.

[Risks and why]:

   There is some risk that a new deadlock has been introduced (although multiple deadlocks have been fixed). Since this also overhauls the lifecycle of WebrtcGmpVideoEncoder/Decoder, there is some risk that I made a mistake in a corner-case somewhere, but we're already dealing with a bug that is at least that bad.

[String/UUID change made/needed]:

   None.
Flags: needinfo?(docfaraday)
Attachment #8638086 - Flags: approval-mozilla-aurora?
Comment on attachment 8638088 [details] [diff] [review]
(Beta backport) Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

See approval request for aurora backport.
Attachment #8638088 - Flags: approval-mozilla-beta?
Attachment #8638086 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8638088 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This already landed on inbound.
https://hg.mozilla.org/integration/mozilla-inbound/rev/daf1ef98f111

Byron, for future reference, pulsebot can't comment in s-s bugs, so please don't forget to mark them when you push.
Flags: needinfo?(docfaraday)
Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Seeing MOZ_ASSERTs on Android mozilla-beta:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=ef641c2968c2

The specific assertion shows that test_dataChannel_bug1013809.html causes main to go reentrant when offer/answer concludes, but only on Android. This specific reentrancy was not causing intermittent crashes before because the reentrant callstack was operating on a separate PeerConnection. Sadly, the stack is not deep enough to determine where this happens. Additionally, this is not an H264 test, so chances are the bug does not lie in GMP land.

Jesup, where would be a good place to start looking?
Flags: needinfo?(docfaraday) → needinfo?(rjesup)
Flags: needinfo?(rjesup)
Disable test_dataChannel_bug1013809.html on android, so we can get this landed.
Attachment #8638088 - Attachment is obsolete: true
Comment on attachment 8640087 [details] [diff] [review]
(Beta backport) Clean up dispatches in WebrtcGmpVideoEncoder/Decoder

Asking for approval for disabling test_dataChannel_bug1013809.html, due to bug 1188590. Bug 1188590 is currently blocked by bug 1184117, and we really want to get this fix into beta.
Attachment #8640087 - Flags: sec-approval?
Attachment #8640087 - Flags: sec-approval? → sec-approval+
Whiteboard: [post-critsmash-triage]
Group: core-security → core-security-release
Group: core-security-release
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][b2g-adv-main2.5-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: