Closed
Bug 1002340
Opened 11 years ago
Closed 11 years ago
Shutdown crash with many mozGetUserMedia calls
Categories
(Core :: WebRTC: Audio/Video, defect)
Tracking
()
RESOLVED
FIXED
mozilla32
People
(Reporter: jruderman, Assigned: jesup)
References
Details
(4 keywords, Whiteboard: [adv-main30+])
Attachments
(7 files)
349 bytes,
text/html
|
Details | |
23.17 KB,
text/plain
|
Details | |
26.54 KB,
text/plain
|
Details | |
7.85 KB,
patch
|
Details | Diff | Splinter Review | |
1.32 KB,
patch
|
benjamin
:
review+
Sylvestre
:
approval-mozilla-aurora+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
8.36 KB,
text/plain
|
Details | |
8.34 KB,
text/plain
|
Details |
1. Load the testcase in an ASan build of Firefox
2. Quit Firefox
Result: it usually crashes, but not always in the same place
Reporter | ||
Comment 1•11 years ago
|
||
Reporter | ||
Comment 2•11 years ago
|
||
Assignee | ||
Comment 3•11 years ago
|
||
A track is clearly still in use after MSG has destroyed itself.
MediaEngineWebRTCAudio holds the SourceMediaStreams in an nsTArray<SourceMediaStream *>, since it has to access it off-mainthread, but protects it with a monitor and shutting down a track should Stop the source, removing it from the array.
My best guess is a race to shut down all the streams, and at a guess MSG isn't waiting for confirmation they're shut down, or MediaManager is shutting down the MediaEngineWebRTCAudio instance, and not waiting for that to stop. Code investigation should point us at one. I'll start by looking at MediaManager.
Updated•11 years ago
|
Group: media-core-security
Updated•11 years ago
|
Assignee: nobody → rjesup
Assignee | ||
Comment 4•11 years ago
|
||
Ok, so this appears to be all tied to MediaEngineDefault timer data insertion into MSG. It looks like fake sources aren't being stopped synchronously (enough) to avoid timer races.
Updated•11 years ago
|
status-firefox32:
--- → affected
tracking-firefox32:
--- → +
Assignee | ||
Comment 5•11 years ago
|
||
ok, I know what's happening approximately now:
a) the system is overloaded with fake media elements. Audio ones in particular - to the point where the event queues appear to overload.
b) I dispatch 300 MEDIA_STOP runnables to the MediaManager thread (from GetUserMediaCallbackMediaStreamListener::Invalidate()) to stop the fake streams. These runnables never run.... even though MediaManager thread continues to process events. (And I added a test for Dispatch() failure for sanity.)
c) Likewise, XPCOM Shutdown() tells the MediaManager thread to shut down, but it never finishes the thread Shutdown() call, which relies on a runnable to be added to the thread and complete before it shuts down.
d) switching the fake audio streams from nsITimer::TYPE_REPEATING_PRECISE to TYPE_REPEATING_SLACK fixes things.
With 300 repeating_precise timers, there's no way even my fast system can insert MSG data at that rate in less than 10ms. (Ok, no way to reliably do so; it might at times)
So:
a) I could make fake data SLACK and if it glitches, oh well. Probably not a horrible idea.
b) I can make all the fake media sources use one timer. While more efficient, it's more code/locking/etc, and the primary overhead is almost certainly the MSG insertion and resampling. And for real-world use, I don't care about fake streams to begin with, let alone a large number of them.
c) The fact that this can lead to asan failures in GetGeneration and in MSG implies strongly there are some timing holes in the timer code (or elsewhere), since I shouldn't be able to provoke ASAN failures there with this code (unless something is dropping excess events instead of just making a Big Queue).
Adding some people who've actually touched the timer code, such as it is.
I assume
a) timer events, even PRECISE, can't "jump the queue"
b) there is no limit to the number of queued events
bsmedberg (or bz/ehsan, etc): is this a reasonable assumption? Any ideas?
Flags: needinfo?(benjamin)
Assignee | ||
Comment 6•11 years ago
|
||
Incorporates a version of the patch for bug 988877 (MediaManager thread shutdown, which seem to not matter here - fails with and without).
![]() |
||
Comment 7•11 years ago
|
||
> a) timer events, even PRECISE, can't "jump the queue"
In what sense?
> b) there is no limit to the number of queued events
I believe this is true.
Assignee | ||
Comment 8•11 years ago
|
||
Ok, switching to TYPE_REPEATING_PRECISE_CAN_SKIP seems to avoid the overload from eating us alive, and will work better than SLACK.
We still have a clear sec issue here with the timers, in that this bug (with the pileup of events) can cause an ASAN hit in GetGeneration. This *shouldn't* be possible with timers. We should likely spin off a sec bug on that.
Assignee | ||
Comment 9•11 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #7)
> > a) timer events, even PRECISE, can't "jump the queue"
>
> In what sense?
As in, a PRECISE timer can't get inserted at the head of the queue (which might be a sensible API for a PRECISE timer in an abstract sense, so long as there are no critical control events it jumps over (like Shutdown or something to turn off the timer)). With how we use event queues, jumping would be bad unless there was some limitation.
Assignee | ||
Comment 10•11 years ago
|
||
Assignee | ||
Comment 11•11 years ago
|
||
PRECISE_CAN_SKIP seems to work well. However, I caught a *different* ASAN failure in DOM Workers; I'll file a bug on that
Assignee | ||
Comment 12•11 years ago
|
||
Bug 1006700 - also note it's an assertion, not an ASAN issue
Assignee | ||
Updated•11 years ago
|
Attachment #8418193 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 13•11 years ago
|
||
Comment on attachment 8418193 [details] [diff] [review]
Allow fake audio timers to skip if we overrun
Switching review to bsmedberg as bz is swamped at the moment
Attachment #8418193 -
Flags: review?(bzbarsky) → review?(benjamin)
Updated•11 years ago
|
Attachment #8418193 -
Flags: review?(benjamin) → review+
Flags: needinfo?(benjamin)
Assignee | ||
Comment 14•11 years ago
|
||
Note: I didn't ask for sec-approval on this patch in the bug. Talked with abillings on IRC.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9cecb6a3439
This affects pretty much all revs. I'd suggest WONTFIX on this patch for 29 unless there's a chemspill, and probably not even then. I'm ambivalent as to whether this is needed for 30 or ESR on it's own. Mostly this is just a DoS attack; I'm going to file a followup bug for the ASAN hit (without the patch here) in Timer, since that may be a serious separate bug triggered by this. The only reason I care about uplifting this patch at all is because it's a way to trigger the timer ASAN and MSG ASAN issues.
I'm still not sure how this invokes the MSG ASAN issue, so leave-opening for that. With this patch it may be impossible to hit. Also the MSG issue might be only on 31 and 32, if it depends on padenot's changes.
status-b2g-v1.2:
--- → affected
status-b2g-v1.3:
--- → affected
status-b2g-v1.3T:
--- → affected
status-b2g-v1.4:
--- → affected
status-firefox29:
--- → affected
status-firefox30:
--- → affected
status-firefox31:
--- → affected
status-firefox-esr24:
--- → affected
tracking-firefox30:
--- → ?
tracking-firefox31:
--- → ?
Whiteboard: [leave-open]
Updated•11 years ago
|
Comment 15•11 years ago
|
||
Assignee | ||
Comment 16•11 years ago
|
||
Test for this won't be checked in until any resulting ASAN issues are dealt with and we're clean including release
![]() |
||
Comment 17•11 years ago
|
||
> As in, a PRECISE timer can't get inserted at the head of the queue
Correct. What it can do is add more events to the queue while its callback is running, but the queue is a queue.
Comment 18•11 years ago
|
||
Can we get uplift nominations here for Aurora/Beta?
Flags: needinfo?(rjesup)
Assignee | ||
Comment 19•11 years ago
|
||
Comment on attachment 8418193 [details] [diff] [review]
Allow fake audio timers to skip if we overrun
[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: DoS possible. Biggest impact is on test especially b2g emulator. May cause other security issues to be exposed by allowing a silly amount of load to be generated.
Testing completed (on m-c, etc.): On m-c for some time.
Risk to taking this patch (and alternatives if risky): extremely small
String or IDL/UUID changes made by this patch: none
Attachment #8418193 -
Flags: approval-mozilla-beta?
Attachment #8418193 -
Flags: approval-mozilla-aurora?
Flags: needinfo?(rjesup)
Updated•11 years ago
|
Attachment #8418193 -
Flags: approval-mozilla-beta?
Attachment #8418193 -
Flags: approval-mozilla-beta+
Attachment #8418193 -
Flags: approval-mozilla-aurora?
Attachment #8418193 -
Flags: approval-mozilla-aurora+
Comment 20•11 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/0298a65d3875
https://hg.mozilla.org/releases/mozilla-beta/rev/6879a3064321
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/f99735a7b8ed
https://hg.mozilla.org/releases/mozilla-b2g26_v1_2/rev/53e59fb3f5a4
Does this bug still need to be open? Also, does comment 14 still hold for esr24?
status-b2g-v2.0:
--- → fixed
Flags: needinfo?(rjesup)
Assignee | ||
Comment 21•11 years ago
|
||
I'm going to suggest WONTFIX for ESR24 for now. It's not a direct sec bug itself, just a pretty poor form of DOS attack.
Flags: needinfo?(rjesup)
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [leave-open]
Comment 22•11 years ago
|
||
Updated•11 years ago
|
Whiteboard: [adv-main30+]
Updated•11 years ago
|
Group: media-core-security
Updated•11 years ago
|
QA Contact: kamiljoz
Comment 23•11 years ago
|
||
I went through the verification process using the following build(s):
- http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-asan/latest/ (Changeset: 187089:7146e89a7b83)
- http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-linux64-asan/latest/ (Changeset: 192965:d9edba5338ff)
- http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64-asan/latest/ (Changeset: 192431:6da3dd0a2e7b)
- M-C (Crashed within a minute of start up)
- Aurora (Crashed within a minute of start up)
- BETA (Didn't crash)
(Used Ubuntu 13.10 on VM)
- Opened the test case in a single tab and waited several minutes
- Opened the test case in several different tabs
- Opened the test case in several windows
- Opened the test case in several e10s (m-c)
- Opened the test case in several "Private Windows"
- Opened the test case, waited a minute and then closed fx
It looks like there's still a crash with m-c and aurora. When you open the test case from comment #0 and wait for about a minute, you'll end up receiving a crash. I'm not sure if this is the same crash as originally reported but it happens on both m-c and aurora 100% of the time.
It appears like it doesn't affect beta as I couldn't reproduce the crash. I had about 20 instances of the test case opened in beta and never received a single crash. I've attached the crashes for both m-c (mcCrash.txt) and aurora (auroraCrash.txt).
STR:
- download/install either the m-c or aurora asan build (used linux)
- once installed, run fx and open the test case from comment #0
- wait about one minute and it will automatically crash
Let me know if there's anymore information that I can provide! Once I get home, I'll build the latest asan's in OSX and see if it happens there also.
Comment 24•11 years ago
|
||
Comment 25•11 years ago
|
||
Randell, mind taking a quick look at the crashes reported above? The latest linux ASan builds under fx crash within a minute of opening the POC. The only one that doesn't crash is the beta build.
Flags: needinfo?(rjesup)
Assignee | ||
Comment 26•11 years ago
|
||
I can repro the MediaStreamGraph crash on nightly with the testcase. Per above, we blocked the primary vector for this, but hadn't found the asan bugs exposed by it. We hadn't been able to repro the mediastream one well; now we seem to (or more likely there were some contributing changes which re-exposed it).
It does *not* seem to repro on a no-opt ASAN build, only a debug-opt build. Best guess is that the GraphImpl() is going away during the resample, which makes sense.
I'll file a new bug on this issue; let's leave this bug on the event-queue-runaway issue (we have one on the timer already)
Flags: needinfo?(rjesup)
Updated•11 years ago
|
Comment 27•11 years ago
|
||
Fixed in SeaMonkey 2.26.1 (based on Gecko 29) with
https://hg.mozilla.org/releases/mozilla-release/rev/7a22ed0a3752
status-seamonkey2.26:
--- → fixed
Comment 28•11 years ago
|
||
So, Randell or anyone, feel free to weigh in, but it appears that this bug was not fixed on Fx31 or Fx32.
The related bug 1022945 is currently being tracked for Fx33 only.
Shouldn't flags for one or both bugs indicate that Fx31 and Fx32 are affected? And that we are tracking as "affected" and/or "wontfix" for one or both?
Updated•9 years ago
|
Group: core-security
You need to log in
before you can comment on or make changes to this bug.
Description
•