races with LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP and NotifyOutputData()

RESOLVED FIXED in Firefox -esr52

Status

()

defect
P1
normal
Rank:
7
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: karlt, Assigned: karlt)

Tracking

({crash, csectype-wildptr, sec-high})

Trunk
mozilla59
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr5258+ fixed, firefox56 wontfix, firefox57 wontfix, firefox58+ fixed, firefox59 fixed)

Details

(Whiteboard: [adv-main58+][adv-esr52.6+][post-critsmash-triage])

Attachments

(3 attachments, 4 obsolete attachments)

1) When MediaStreamGraphImpl::UpdateMainThreadState() determines that the
   graph is empty, it sets
   mLifecycleState = LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP and then
   releases the graph monitor.

Consider that GraphDriver is an AudioCallbackDriver.

Consider that a new message is subsequently received:

A2) If MediaStreamGraphImpl::RunInStableState() calls Revive() on the
    GraphDriver.

Typically there is a NextDriver() because there are no audio outputs when the
graph is empty.  In this case:

Aa3) Revive() calls ThreadedDriver::Start(), which triggers
     MediaStreamGraphInitThreadRunnable(), which schedules an async Stop() of
     the AudioCallbackDriver and calls RunThread()

Aa4) AudioCallbackDriver::DataCallback() has not yet completed and calls
     NotifyOutputData().  This accesses mAudioInputs which can race if new
     mAudioInputs are added to the graph from RunThread().

Aa5) AudioCallbackDriver::Stop() calls stop on the cubeb stream, and
     DataCallback() returns to indicate that the stream drain.

What if there is not a NextDriver()?
The only case I can currently see where UpdateStreamOrder() would not
SetNextDriver() on an empty graph would be if
!AudioCallbackDriver::IsStarted().

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d77478b6f51a810f54535929b5ad5699578bd3d9
adds assertions that this case is not happening, and the tests do not detect.

In this case:

Ab3) Revive() triggers AudioCallbackDriver::Start() which calls destroy() on
     the old stream and starts a new stream.  stop() has not been called on
     the old stream.

Ab4) AudioCallbackDriver::mBuffer is modified during DataCallback() for the
     old stream.  Presumably this races with mBuffer access from
     DataCallback() for the new stream.  NotifyOutputData() has a similar
     race.
   
Ab5) DataCallback() returns to indicate that the old cubeb stream drain.

Consider that no new message is received.

C2) MediaStreamGraphImpl::RunInStableState() dispatches
    MediaStreamGraphShutDownRunnable to main thread.

C3) MediaStreamGraphShutDownRunnable::Run() calls Destroy() on the graph which
    releases its self reference.

C4) AudioCallbackDriver::DataCallback() calls NotifyOutputData() on the graph
    which no longer exists.
This race would be hard to engineer.
Assignee: nobody → karlt
Status: NEW → ASSIGNED
Keywords: sec-high
Group: core-security → media-core-security
Amazing analysis, thanks, I wonder how you found this.

Also, I'm currently rewriting a bunch of things around mAudioInputs, currently thinking about the optimal lifetime stories for all this. We'll see how it goes, but thanks for the detailed scenario, I'll run it against my new designs.
Thanks, Paul.  Sorry, this path should be fine due to the NS_DISPATCH_SYNC in AudioCallbackDriver::Shutdown() before C3.

(In reply to Karl Tomlinson (:karlt) from comment #0)
> Consider that no new message is received.
> 
> C2) MediaStreamGraphImpl::RunInStableState() dispatches
>     MediaStreamGraphShutDownRunnable to main thread.
> 
> C3) MediaStreamGraphShutDownRunnable::Run() calls Destroy() on the graph
> which
>     releases its self reference.
> 
> C4) AudioCallbackDriver::DataCallback() calls NotifyOutputData() on the graph
>     which no longer exists.

I'm planning to add AsyncCubebOperation::SHUTDOWN for Ab3, which should also resolve Ab4.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6d715927e365367cf0dbb26350ed4088f27aab2f

The Revive() path is an edge case that is hard to test.

For Aa4, I'm planning to move the transition to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP to the end of DataCallback(), and something similar for ThreadedDriver().  Perhaps that's unnecessary if moving NotifyOutputData() is part of your plans.  Even if it's not, then moving LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP may still help with maintaining the concept of a single graph thread at a time.
The races with pointers in mAudioInputs are still technically sec-critical and mitigating circumstances reduce this to only sec-high, even if very difficult to reproduce in this case, and so sec-approval will be required.

The patches are tested on try (without referencing this bug):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e62b7431c4b51ab96491180b861c813906a8867

I don't know the consequences of destroying a cubeb stream before stop.
This try run, with no stops, didn't find any obvious problems,
assuming "Assertion failure: [GFX1]: BlitImage(SurfaceTextureImage)" is not related.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8dba813a6f6c44442968e66c25bace2551b65dca

Should we backport the missing stop?
Attachment #8922609 - Attachment description: remove unused GraphDriver::Destroy() → part 2: stop cubeb stream before it gets destroyed when starting new stream on Revive()
Comment on attachment 8922610 [details] [diff] [review]
part 3: assert that an old stream is stopped before a new stream is created

Moving the implicit cubeb_stream_destroy() from Init() to Stop() makes the
assertion possible.
Attachment #8922610 - Attachment description: stop cubeb stream before it gets destroyed when starting new stream on Revive() → part 3: assert that an old stream is stopped before a new stream is created
Attachment #8922614 - Flags: review?(padenot)
The NotifyOutput race was introduced for 46 in
https://hg.mozilla.org/mozilla-central/rev/2d18cbea1800c23f06d78fc7c9c64503d2ab1ce3

The stop was also missing at that point.
Comment on attachment 8922613 [details] [diff] [review]
part 4: move to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP at end of iteration

Review of attachment 8922613 [details] [diff] [review]:
-----------------------------------------------------------------

A little comment. r+ however.

::: dom/media/GraphDriver.cpp
@@ +332,5 @@
> +      LOG(LogLevel::Debug,
> +          ("MediaStreamGraph %p waiting for main thread cleanup", this));
> +      mGraphImpl->mLifecycleState =
> +        MediaStreamGraphImpl::LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP;
> +      mGraphImpl->EnsureStableStateEventPosted();

It feels weird to touch mLifecycleState and EnsureStableStateEventPosted from a driver. I'd push this into the MediaStreamGraph, at the end of OneIteration.
Attachment #8922613 - Flags: review?(padenot) → review+
Attachment #8922614 - Flags: review?(padenot) → review+
Comment on attachment 8922609 [details] [diff] [review]
part 2: stop cubeb stream before it gets destroyed when starting new stream on Revive()

Review of attachment 8922609 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/GraphDriver.cpp
@@ +797,5 @@
>          ("Starting audio threads for MediaStreamGraph %p from a new thread.",
>           mGraphImpl));
> +    // "`cubeb_stream_stop` MUST be called before destroying a stream."
> +    // The AsyncCubebTasks are performed sequentially and so shutdown
> +    // completes before init.

This is perfectly true. However, in the backends, code, we see:
- We join the thread on windows: https://github.com/kinetiknz/cubeb/blob/master/src/cubeb_wasapi.cpp#L1904 (including a little dance in a weird case)
- On OSX: https://github.com/kinetiknz/cubeb/blob/master/src/cubeb_audiounit.cpp#L2617. More over, we have the guarantee that AudioOutputUnitStop is synchronous. If called outside the callback, the callback will not be called after it has returned.
- On Linux/Pulse: https://github.com/kinetiknz/cubeb/blob/master/src/cubeb_pulse.c#L957 the callback is actually removed

Have you observed this being not true? Maybe I'm missing something.
Attachment #8922609 - Flags: review?(padenot) → review+
Attachment #8922610 - Flags: review?(padenot) → review+
(In reply to Paul Adenot (:padenot) from comment #12)
> This is perfectly true. However, in the backends, code, we see:
> - We join the thread on windows:
> https://github.com/kinetiknz/cubeb/blob/master/src/cubeb_wasapi.cpp#L1904
> (including a little dance in a weird case)
> - On OSX:
> https://github.com/kinetiknz/cubeb/blob/master/src/cubeb_audiounit.cpp#L2617.
> More over, we have the guarantee that AudioOutputUnitStop is synchronous. If
> called outside the callback, the callback will not be called after it has
> returned.
> - On Linux/Pulse:
> https://github.com/kinetiknz/cubeb/blob/master/src/cubeb_pulse.c#L957 the
> callback is actually removed
> 
> Have you observed this being not true? Maybe I'm missing something.

The issues reported here are just from analyzing the code because I wanted to
understand it so that I could move some things around for bug 1228226.

The requirement was added to the documentation in
https://github.com/kinetiknz/cubeb/commit/dcc90f40c89cd304a85ed687b59b77d0542dbd2c#diff-a2acb4d7d9909e5853f52f32b203283dR430
How do I find out what comments that is addressing?

Based on the requirement for stream_stop in cubeb documentation, I wondered
whether the stop call may be required to ensure that the callback has returned
before stream_destroy started deleting structures, or a new stream is started.
If stop is not required, then we don't need to backport this patch.

For the pulse backend, completion of the callback seems to depend on pulse
holding the mainloop lock during the entire duration of the write callback,
and on cubeb_stream_destroy() taking the lock before it does anything.  I can
verify the latter, and I assume the former is also true, as "Events that are
dispatched from the event loop thread are executed with this lock held."
https://freedesktop.org/software/pulseaudio/doxygen/thread-mainloop_8h.html#a2fd002f510ed24cabacb63a26bf4c977

Your reasoning for OSX and WASAPI is also convincing.

I haven't found appropriate documentation for OpenSL AUDIOPLAYER, but if
Destroy() is synchronous, then that should be good too.

That would mean that the only purpose of this patch is to comply with the
cubeb documentation.
(In reply to Paul Adenot (:padenot) from comment #11)
> ::: dom/media/GraphDriver.cpp
> @@ +332,5 @@
> > +      LOG(LogLevel::Debug,
> > +          ("MediaStreamGraph %p waiting for main thread cleanup", this));
> > +      mGraphImpl->mLifecycleState =
> > +        MediaStreamGraphImpl::LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP;
> > +      mGraphImpl->EnsureStableStateEventPosted();
> 
> It feels weird to touch mLifecycleState and EnsureStableStateEventPosted
> from a driver. I'd push this into the MediaStreamGraph, at the end of
> OneIteration.

Prior to this patch, because UpdateMainThreadState() returns immediately after
this, it is already at the end of OneIteration().

The problem is that doing this is handing over the graph to another thread,
but DataCallback() uses objects assuming that no other thread is running the
graph.  Perhaps NotifyOutputData() could be called from OneIteration(), but
moving mBuffer.BufferFilled() is more awkward.

http://searchfox.org/mozilla-central/rev/1ebd2eff44617df3b82eea7d2f3ca1b60cc591a0/dom/media/GraphDriver.cpp#1002

I was considering moving UpdateMainThreadState() out of OneIteration() and
calling UpdateMainThreadState() from the drivers.  That would keep all main
thread messages in one place.

What do you think?
Attachment #8923285 - Flags: review?(padenot)
Rank: 7
Priority: -- → P1
(In reply to Karl Tomlinson (:karlt) from comment #13)
> That would mean that the only purpose of this patch is to comply with the
> cubeb documentation.

Except that clearing mAudioStream after stop also makes the test added at
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e4bb3a203a46f97ba0b5bb681bb1fb15e4474e4#l1.16
behave correctly in the case of errors during start from Revive(),
and so there is value in uplifting the patches related to stop.

Due to that need for null mAudioStream during cubeb_stream_init(), it would
be better to move the assert in Init() to before the cubeb_stream_init() call.
The assert could even be at the beginning of the method.
(In reply to Karl Tomlinson (:karlt) from comment #14)
> Created attachment 8923285 [details] [diff] [review]
> part 4 v2: move to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP at end of
> iteration
> 
> (In reply to Paul Adenot (:padenot) from comment #11)
> > ::: dom/media/GraphDriver.cpp
> > @@ +332,5 @@
> > > +      LOG(LogLevel::Debug,
> > > +          ("MediaStreamGraph %p waiting for main thread cleanup", this));
> > > +      mGraphImpl->mLifecycleState =
> > > +        MediaStreamGraphImpl::LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP;
> > > +      mGraphImpl->EnsureStableStateEventPosted();
> > 
> > It feels weird to touch mLifecycleState and EnsureStableStateEventPosted
> > from a driver. I'd push this into the MediaStreamGraph, at the end of
> > OneIteration.
> 
> Prior to this patch, because UpdateMainThreadState() returns immediately
> after
> this, it is already at the end of OneIteration().
> 
> The problem is that doing this is handing over the graph to another thread,
> but DataCallback() uses objects assuming that no other thread is running the
> graph.  Perhaps NotifyOutputData() could be called from OneIteration(), but
> moving mBuffer.BufferFilled() is more awkward.
> 
> http://searchfox.org/mozilla-central/rev/
> 1ebd2eff44617df3b82eea7d2f3ca1b60cc591a0/dom/media/GraphDriver.cpp#1002
> 
> I was considering moving UpdateMainThreadState() out of OneIteration() and
> calling UpdateMainThreadState() from the drivers.  That would keep all main
> thread messages in one place.
> 
> What do you think?

I like this one better, yes. One of the job of a GraphDriver is to "drive" the graph (hence the name). `UpdateMainThreadState` decides whether or not to continue rendering, so it's conceptually valid.

Touching mLifecycleState and EnsureStateStateEventPosted from the driver felt like an encapsulation breakage.
Attachment #8923285 - Flags: review?(padenot) → review+
Comment on attachment 8922609 [details] [diff] [review]
part 2: stop cubeb stream before it gets destroyed when starting new stream on Revive()

Having noticed https://bugzilla.mozilla.org/show_bug.cgi?id=1382366#c19
I realize that stopping the stream while mAudioStream is set and there is no
scheduled next driver would add another path for similar issues, and so I'll
drop this patch from this bug.  Thanks for pointing out that a missing stop is not a security issue.  I'll post the patch elsewhere when this bug and bug 1382366 are sorted out.
Attachment #8922609 - Attachment is obsolete: true
Comment on attachment 8922610 [details] [diff] [review]
part 3: assert that an old stream is stopped before a new stream is created

This patch depended on attachment 8922609 [details] [diff] [review] and so I'll also drop this patch.
As failure to stop is not currently a security issue, the asserts may be
pointless.

(In reply to Karl Tomlinson (:karlt) from comment #15)
> Except that clearing mAudioStream after stop also makes the test added at
> https://hg.mozilla.org/integration/mozilla-inbound/rev/
> 1e4bb3a203a46f97ba0b5bb681bb1fb15e4474e4#l1.16
> behave correctly in the case of errors during start from Revive(),
> and so there is value in uplifting the patches related to stop.

I'll fix this differently together with fixing bug 1382366.
Attachment #8922610 - Attachment is obsolete: true
For bug 1382366, I want to set state on a graph driver member after finding
out that the graph is finishing but before handing over to the main thread
for clean up.  The approach in v2 precludes that.
I'm sorry about the churn here.

This v3 is conceptually similar to v1 but adds a graph method specifically for
this purpose, and so there are separate functions for UpdateMainThreadState()
and SignalMainThreadCleanup().  Like v2 however, the existing locking is left
unchanged.

There is also a documentation fix for mLifecycleState to indicate the real
threads involved.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b3d9e9ec6b167b079cb48e7f0bf705e90eea688f&selectedJob=142117514
The same c1 leaks occurred on the base m-c revision:
https://treeherder.mozilla.org/logviewer.html#?job_id=142033133&repo=mozilla-central
Attachment #8922613 - Attachment is obsolete: true
Attachment #8923285 - Attachment is obsolete: true
Attachment #8925420 - Flags: review?(padenot)
Blocks: 1382366
Comment on attachment 8925420 [details] [diff] [review]
part 4 v3: move to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP at end of iteration

Review of attachment 8925420 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm, though I don't really have much opinion on where responsibilities should lie
Attachment #8925420 - Flags: feedback+
Attachment #8925420 - Flags: review?(padenot) → review+
Comment on attachment 8925420 [details] [diff] [review]
part 4 v3: move to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP at end of iteration

[Security approval request comment]
> How easily could an exploit be constructed based on the patch?

A exploit would require the race-winning threads to run much faster than
another thread.  At this point, the thread that would need to be slow should
not be influenced by content (because this situation occurs when the graph is
empty), and so it is very difficult to rig the race.

An attacker could trigger many of these races, and trigger many threads so as
to increase the chance that some are starved, and so I guess have a chance of
triggering this situation.

> Do comments in the patch, the check-in comment, or tests included in the
> patch paint a bulls-eye on the security problem?

Comments in the patch describe how the code works and what is changed.
They describe a change in order of execution, from which it is not too
difficult to deduce a UaF.

> Which older supported branches are affected by this flaw?

Versions since 46.

> If not all supported branches, which bug introduced the flaw?

All supported.

> Do you have backports for the affected branches?
> If not, how different, hard to create, and risky will they be?

Backport for 52 is ready.  Others would not be difficult.
Code has not changed much.

> How likely is this patch to cause regressions; how much testing does it need?

There is a moderate risk here, limited to Web Audio and WebRTC.

There are many tests, but the interaction of threads means that it is not
possible to explicitly test all possible paths.  Some paths occur sometimes,
according to timing, and so it would be reassuring to have these tests running
on m-c for a few days before uplift.
Attachment #8925420 - Flags: sec-approval?
See Also: → 1415556
sec-approval+ for checkin on 11/28 or immediately after. Do not check this in before then as that is two weeks into the 58 cycle. This won't make the 57 release as it is too late for that.
Attachment #8925420 - Flags: sec-approval? → sec-approval+
https://hg.mozilla.org/mozilla-central/rev/37ce6e1e2115
https://hg.mozilla.org/mozilla-central/rev/1b27196751a8
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Please request Beta/ESR52 approval on the patches where applicable.
Flags: needinfo?(karlt)
Whiteboard: [checkin on 11/28]
Group: media-core-security → core-security-release
Comment on attachment 8925420 [details] [diff] [review]
part 4 v3: move to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP at end of iteration

Approval Request Comment

Only this patch need land on beta 58.
The patch attached to the bug should be used.
The changeset on m-c required conflict resolution with 59 changes.

[Feature/Bug causing the regression]:
bug 1221587
[User impact if declined]: sec-high
[Is this code covered by automated tests?]:
There are many tests, but the interaction of threads means that it is not
possible to explicitly test all possible paths.  Some paths occur sometimes,
according to timing.
[Has the fix been verified in Nightly?]:
No.  We don't have a reproducing testcase.
[Needs manual test from QE? If yes, steps to reproduce]:
No.
[List of other uplifts needed for the feature/fix]:
None.
[Is the change risky?]:
There is a moderate risk here, limited to Web Audio and WebRTC.
[Why is the change risky/not risky?]:
Some paths occur sometimes, according to timing.
[String changes made/needed]:
None.
Flags: needinfo?(karlt)
Attachment #8925420 - Flags: approval-mozilla-beta?
Comment on attachment 8926259 [details] [diff] [review]
esr52 patch (part 4 only)

[Approval Request Comment]

Only this patch need land on ESR52.

If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined:
sec-high
Fix Landed on Version:
59
Risk to taking this patch (and alternatives if risky):
Moderate
String or UUID changes made by this patch:
None.
Attachment #8926259 - Flags: approval-mozilla-esr52?
Comment on attachment 8925420 [details] [diff] [review]
part 4 v3: move to LIFECYCLE_WAITING_FOR_MAIN_THREAD_CLEANUP at end of iteration

Fix a sec-high. Beta58+ & ESR52+.
Attachment #8925420 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8926259 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
See Also: → 1415755
Whiteboard: [adv-main58+][adv-esr52.6+]
Flags: qe-verify-
Whiteboard: [adv-main58+][adv-esr52.6+] → [adv-main58+][adv-esr52.6+][post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.