Closed Bug 1002340 Opened 11 years ago Closed 11 years ago

Shutdown crash with many mozGetUserMedia calls

Categories

(Core :: WebRTC: Audio/Video, defect)

x86_64
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla32
Tracking Status
firefox29 --- wontfix
firefox30 + verified
firefox31 + fixed
firefox32 + fixed
firefox-esr24 --- wontfix
b2g-v1.2 --- fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
seamonkey2.26 --- fixed

People

(Reporter: jruderman, Assigned: jesup)

References

Details

(4 keywords, Whiteboard: [adv-main30+])

Attachments

(7 files)

Attached file testcase
1. Load the testcase in an ASan build of Firefox 2. Quit Firefox Result: it usually crashes, but not always in the same place
A track is clearly still in use after MSG has destroyed itself. MediaEngineWebRTCAudio holds the SourceMediaStreams in an nsTArray<SourceMediaStream *>, since it has to access it off-mainthread, but protects it with a monitor and shutting down a track should Stop the source, removing it from the array. My best guess is a race to shut down all the streams, and at a guess MSG isn't waiting for confirmation they're shut down, or MediaManager is shutting down the MediaEngineWebRTCAudio instance, and not waiting for that to stop. Code investigation should point us at one. I'll start by looking at MediaManager.
Group: media-core-security
Assignee: nobody → rjesup
Ok, so this appears to be all tied to MediaEngineDefault timer data insertion into MSG. It looks like fake sources aren't being stopped synchronously (enough) to avoid timer races.
ok, I know what's happening approximately now: a) the system is overloaded with fake media elements. Audio ones in particular - to the point where the event queues appear to overload. b) I dispatch 300 MEDIA_STOP runnables to the MediaManager thread (from GetUserMediaCallbackMediaStreamListener::Invalidate()) to stop the fake streams. These runnables never run.... even though MediaManager thread continues to process events. (And I added a test for Dispatch() failure for sanity.) c) Likewise, XPCOM Shutdown() tells the MediaManager thread to shut down, but it never finishes the thread Shutdown() call, which relies on a runnable to be added to the thread and complete before it shuts down. d) switching the fake audio streams from nsITimer::TYPE_REPEATING_PRECISE to TYPE_REPEATING_SLACK fixes things. With 300 repeating_precise timers, there's no way even my fast system can insert MSG data at that rate in less than 10ms. (Ok, no way to reliably do so; it might at times) So: a) I could make fake data SLACK and if it glitches, oh well. Probably not a horrible idea. b) I can make all the fake media sources use one timer. While more efficient, it's more code/locking/etc, and the primary overhead is almost certainly the MSG insertion and resampling. And for real-world use, I don't care about fake streams to begin with, let alone a large number of them. c) The fact that this can lead to asan failures in GetGeneration and in MSG implies strongly there are some timing holes in the timer code (or elsewhere), since I shouldn't be able to provoke ASAN failures there with this code (unless something is dropping excess events instead of just making a Big Queue). Adding some people who've actually touched the timer code, such as it is. I assume a) timer events, even PRECISE, can't "jump the queue" b) there is no limit to the number of queued events bsmedberg (or bz/ehsan, etc): is this a reasonable assumption? Any ideas?
Flags: needinfo?(benjamin)
Incorporates a version of the patch for bug 988877 (MediaManager thread shutdown, which seem to not matter here - fails with and without).
> a) timer events, even PRECISE, can't "jump the queue" In what sense? > b) there is no limit to the number of queued events I believe this is true.
Ok, switching to TYPE_REPEATING_PRECISE_CAN_SKIP seems to avoid the overload from eating us alive, and will work better than SLACK. We still have a clear sec issue here with the timers, in that this bug (with the pileup of events) can cause an ASAN hit in GetGeneration. This *shouldn't* be possible with timers. We should likely spin off a sec bug on that.
(In reply to Boris Zbarsky [:bz] from comment #7) > > a) timer events, even PRECISE, can't "jump the queue" > > In what sense? As in, a PRECISE timer can't get inserted at the head of the queue (which might be a sensible API for a PRECISE timer in an abstract sense, so long as there are no critical control events it jumps over (like Shutdown or something to turn off the timer)). With how we use event queues, jumping would be bad unless there was some limitation.
PRECISE_CAN_SKIP seems to work well. However, I caught a *different* ASAN failure in DOM Workers; I'll file a bug on that
See Also: → 1006700
Bug 1006700 - also note it's an assertion, not an ASAN issue
Attachment #8418193 - Flags: review?(bzbarsky)
Comment on attachment 8418193 [details] [diff] [review] Allow fake audio timers to skip if we overrun Switching review to bsmedberg as bz is swamped at the moment
Attachment #8418193 - Flags: review?(bzbarsky) → review?(benjamin)
Attachment #8418193 - Flags: review?(benjamin) → review+
Flags: needinfo?(benjamin)
Note: I didn't ask for sec-approval on this patch in the bug. Talked with abillings on IRC. https://hg.mozilla.org/integration/mozilla-inbound/rev/f9cecb6a3439 This affects pretty much all revs. I'd suggest WONTFIX on this patch for 29 unless there's a chemspill, and probably not even then. I'm ambivalent as to whether this is needed for 30 or ESR on it's own. Mostly this is just a DoS attack; I'm going to file a followup bug for the ASAN hit (without the patch here) in Timer, since that may be a serious separate bug triggered by this. The only reason I care about uplifting this patch at all is because it's a way to trigger the timer ASAN and MSG ASAN issues. I'm still not sure how this invokes the MSG ASAN issue, so leave-opening for that. With this patch it may be impossible to hit. Also the MSG issue might be only on 31 and 32, if it depends on padenot's changes.
Flags: in-testsuite?
Target Milestone: --- → mozilla32
Test for this won't be checked in until any resulting ASAN issues are dealt with and we're clean including release
> As in, a PRECISE timer can't get inserted at the head of the queue Correct. What it can do is add more events to the queue while its callback is running, but the queue is a queue.
Can we get uplift nominations here for Aurora/Beta?
Flags: needinfo?(rjesup)
Comment on attachment 8418193 [details] [diff] [review] Allow fake audio timers to skip if we overrun [Approval Request Comment] Bug caused by (feature/regressing bug #): N/A User impact if declined: DoS possible. Biggest impact is on test especially b2g emulator. May cause other security issues to be exposed by allowing a silly amount of load to be generated. Testing completed (on m-c, etc.): On m-c for some time. Risk to taking this patch (and alternatives if risky): extremely small String or IDL/UUID changes made by this patch: none
Attachment #8418193 - Flags: approval-mozilla-beta?
Attachment #8418193 - Flags: approval-mozilla-aurora?
Flags: needinfo?(rjesup)
Attachment #8418193 - Flags: approval-mozilla-beta?
Attachment #8418193 - Flags: approval-mozilla-beta+
Attachment #8418193 - Flags: approval-mozilla-aurora?
Attachment #8418193 - Flags: approval-mozilla-aurora+
I'm going to suggest WONTFIX for ESR24 for now. It's not a direct sec bug itself, just a pretty poor form of DOS attack.
Flags: needinfo?(rjesup)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [leave-open]
Whiteboard: [adv-main30+]
Group: media-core-security
QA Contact: kamiljoz
Attached file mcCrash.txt
I went through the verification process using the following build(s): - http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-asan/latest/ (Changeset: 187089:7146e89a7b83) - http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-linux64-asan/latest/ (Changeset: 192965:d9edba5338ff) - http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64-asan/latest/ (Changeset: 192431:6da3dd0a2e7b) - M-C (Crashed within a minute of start up) - Aurora (Crashed within a minute of start up) - BETA (Didn't crash) (Used Ubuntu 13.10 on VM) - Opened the test case in a single tab and waited several minutes - Opened the test case in several different tabs - Opened the test case in several windows - Opened the test case in several e10s (m-c) - Opened the test case in several "Private Windows" - Opened the test case, waited a minute and then closed fx It looks like there's still a crash with m-c and aurora. When you open the test case from comment #0 and wait for about a minute, you'll end up receiving a crash. I'm not sure if this is the same crash as originally reported but it happens on both m-c and aurora 100% of the time. It appears like it doesn't affect beta as I couldn't reproduce the crash. I had about 20 instances of the test case opened in beta and never received a single crash. I've attached the crashes for both m-c (mcCrash.txt) and aurora (auroraCrash.txt). STR: - download/install either the m-c or aurora asan build (used linux) - once installed, run fx and open the test case from comment #0 - wait about one minute and it will automatically crash Let me know if there's anymore information that I can provide! Once I get home, I'll build the latest asan's in OSX and see if it happens there also.
Randell, mind taking a quick look at the crashes reported above? The latest linux ASan builds under fx crash within a minute of opening the POC. The only one that doesn't crash is the beta build.
Flags: needinfo?(rjesup)
I can repro the MediaStreamGraph crash on nightly with the testcase. Per above, we blocked the primary vector for this, but hadn't found the asan bugs exposed by it. We hadn't been able to repro the mediastream one well; now we seem to (or more likely there were some contributing changes which re-exposed it). It does *not* seem to repro on a no-opt ASAN build, only a debug-opt build. Best guess is that the GraphImpl() is going away during the resample, which makes sense. I'll file a new bug on this issue; let's leave this bug on the event-queue-runaway issue (we have one on the timer already)
Flags: needinfo?(rjesup)
Blocks: 1022945
Fixed in SeaMonkey 2.26.1 (based on Gecko 29) with https://hg.mozilla.org/releases/mozilla-release/rev/7a22ed0a3752
So, Randell or anyone, feel free to weigh in, but it appears that this bug was not fixed on Fx31 or Fx32. The related bug 1022945 is currently being tracked for Fx33 only. Shouldn't flags for one or both bugs indicate that Fx31 and Fx32 are affected? And that we are tracking as "affected" and/or "wontfix" for one or both?
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: