Closed Bug 1002340 Opened 11 years ago Closed 11 years ago

Shutdown crash with many mozGetUserMedia calls

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla32

Tracking Flags:

Tracking

Status

firefox29

---

wontfix

firefox30

verified

firefox31

fixed

firefox32

fixed

firefox-esr24

---

wontfix

b2g-v1.2

---

fixed

b2g-v1.3

---

fixed

b2g-v1.3T

---

fixed

b2g-v1.4

---

fixed

b2g-v2.0

---

fixed

seamonkey2.26

---

fixed

People

(Reporter: jruderman, Assigned: jesup)

References

Details

(4 keywords, Whiteboard: [adv-main30+])

Attachments

(7 files)

testcase 11 years ago Jesse Ruderman 349 bytes, text/html		Details
Crash #1 [@ nsTimerImpl::GetGeneration] 11 years ago Jesse Ruderman 23.17 KB, text/plain		Details
Crash #2 [@ mozilla::MediaStreamGraphImpl::AudioSampleRate] 11 years ago Jesse Ruderman 26.54 KB, text/plain		Details
Patches I'm using to test this NOT FOR CHECKIN 11 years ago Randell Jesup [:jesup] (needinfo me) 7.85 KB, patch		Details \| Diff \| Splinter Review
Allow fake audio timers to skip if we overrun 11 years ago Randell Jesup [:jesup] (needinfo me) 1.32 KB, patch	benjamin : review+ Sylvestre : approval-mozilla-aurora+ Sylvestre : approval-mozilla-beta+	Details \| Diff \| Splinter Review
mcCrash.txt 11 years ago Kamil Jozwiak [:kjozwiak] 8.36 KB, text/plain		Details
auroraCrash.txt 11 years ago Kamil Jozwiak [:kjozwiak] 8.34 KB, text/plain		Details

Jesse Ruderman

Reporter

Description

•

11 years ago

Attached file testcase — Details

1. Load the testcase in an ASan build of Firefox 2. Quit Firefox Result: it usually crashes, but not always in the same place

Jesse Ruderman

Reporter

Comment 1

•

11 years ago

Attached file Crash #1 [@ nsTimerImpl::GetGeneration] — Details

Jesse Ruderman

Reporter

Comment 2

•

11 years ago

Attached file Crash #2 [@ mozilla::MediaStreamGraphImpl::AudioSampleRate] — Details

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 3

•

11 years ago

A track is clearly still in use after MSG has destroyed itself. MediaEngineWebRTCAudio holds the SourceMediaStreams in an nsTArray<SourceMediaStream *>, since it has to access it off-mainthread, but protects it with a monitor and shutting down a track should Stop the source, removing it from the array. My best guess is a race to shut down all the streams, and at a guess MSG isn't waiting for confirmation they're shut down, or MediaManager is shutting down the MediaEngineWebRTCAudio instance, and not waiting for that to stop. Code investigation should point us at one. I'll start by looking at MediaManager.

Daniel Veditz [:dveditz]

Updated

•

11 years ago

Group: media-core-security

Maire Reavy [:mreavy]

Updated

•

11 years ago

Assignee: nobody → rjesup

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 4

•

11 years ago

Ok, so this appears to be all tied to MediaEngineDefault timer data insertion into MSG. It looks like fake sources aren't being stopped synchronously (enough) to avoid timer races.

Al Billings [:abillings - ex-MoCo]

Updated

•

11 years ago

status-firefox32: --- → affected

tracking-firefox32: --- → +

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 5

•

11 years ago

ok, I know what's happening approximately now: a) the system is overloaded with fake media elements. Audio ones in particular - to the point where the event queues appear to overload. b) I dispatch 300 MEDIA_STOP runnables to the MediaManager thread (from GetUserMediaCallbackMediaStreamListener::Invalidate()) to stop the fake streams. These runnables never run.... even though MediaManager thread continues to process events. (And I added a test for Dispatch() failure for sanity.) c) Likewise, XPCOM Shutdown() tells the MediaManager thread to shut down, but it never finishes the thread Shutdown() call, which relies on a runnable to be added to the thread and complete before it shuts down. d) switching the fake audio streams from nsITimer::TYPE_REPEATING_PRECISE to TYPE_REPEATING_SLACK fixes things. With 300 repeating_precise timers, there's no way even my fast system can insert MSG data at that rate in less than 10ms. (Ok, no way to reliably do so; it might at times) So: a) I could make fake data SLACK and if it glitches, oh well. Probably not a horrible idea. b) I can make all the fake media sources use one timer. While more efficient, it's more code/locking/etc, and the primary overhead is almost certainly the MSG insertion and resampling. And for real-world use, I don't care about fake streams to begin with, let alone a large number of them. c) The fact that this can lead to asan failures in GetGeneration and in MSG implies strongly there are some timing holes in the timer code (or elsewhere), since I shouldn't be able to provoke ASAN failures there with this code (unless something is dropping excess events instead of just making a Big Queue). Adding some people who've actually touched the timer code, such as it is. I assume a) timer events, even PRECISE, can't "jump the queue" b) there is no limit to the number of queued events bsmedberg (or bz/ehsan, etc): is this a reasonable assumption? Any ideas?

Flags: needinfo?(benjamin)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 6

•

11 years ago

Attached patch Patches I'm using to test this NOT FOR CHECKIN — Details — Splinter Review

Incorporates a version of the patch for bug 988877 (MediaManager thread shutdown, which seem to not matter here - fails with and without).

Boris Zbarsky [:bzbarsky]

Comment 7

•

11 years ago

> a) timer events, even PRECISE, can't "jump the queue" In what sense? > b) there is no limit to the number of queued events I believe this is true.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 8

•

11 years ago

Ok, switching to TYPE_REPEATING_PRECISE_CAN_SKIP seems to avoid the overload from eating us alive, and will work better than SLACK. We still have a clear sec issue here with the timers, in that this bug (with the pileup of events) can cause an ASAN hit in GetGeneration. This *shouldn't* be possible with timers. We should likely spin off a sec bug on that.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 9

•

11 years ago

(In reply to Boris Zbarsky [:bz] from comment #7) > > a) timer events, even PRECISE, can't "jump the queue" > > In what sense? As in, a PRECISE timer can't get inserted at the head of the queue (which might be a sensible API for a PRECISE timer in an abstract sense, so long as there are no critical control events it jumps over (like Shutdown or something to turn off the timer)). With how we use event queues, jumping would be bad unless there was some limitation.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 10

•

11 years ago

Attached patch Allow fake audio timers to skip if we overrun — Details — Splinter Review

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 11

•

11 years ago

PRECISE_CAN_SKIP seems to work well. However, I caught a *different* ASAN failure in DOM Workers; I'll file a bug on that

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

11 years ago

Comment 12

•

11 years ago

Bug 1006700 - also note it's an assertion, not an ASAN issue

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

11 years ago

Attachment #8418193 - Flags: review?(bzbarsky)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 13

•

11 years ago

Comment on attachment 8418193 [details] [diff] [review] Allow fake audio timers to skip if we overrun Switching review to bsmedberg as bz is swamped at the moment

Attachment #8418193 - Flags: review?(bzbarsky) → review?(benjamin)

Benjamin Smedberg

Updated

•

11 years ago

Attachment #8418193 - Flags: review?(benjamin) → review+

Flags: needinfo?(benjamin)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 14

•

11 years ago

Note: I didn't ask for sec-approval on this patch in the bug. Talked with abillings on IRC. https://hg.mozilla.org/integration/mozilla-inbound/rev/f9cecb6a3439 This affects pretty much all revs. I'd suggest WONTFIX on this patch for 29 unless there's a chemspill, and probably not even then. I'm ambivalent as to whether this is needed for 30 or ESR on it's own. Mostly this is just a DoS attack; I'm going to file a followup bug for the ASAN hit (without the patch here) in Timer, since that may be a serious separate bug triggered by this. The only reason I care about uplifting this patch at all is because it's a way to trigger the timer ASAN and MSG ASAN issues. I'm still not sure how this invokes the MSG ASAN issue, so leave-opening for that. With this patch it may be impossible to hit. Also the MSG issue might be only on 31 and 32, if it depends on padenot's changes.

status-b2g-v1.2: --- → affected

status-b2g-v1.3: --- → affected

status-b2g-v1.3T: --- → affected

status-b2g-v1.4: --- → affected

status-firefox29: --- → affected

status-firefox30: --- → affected

status-firefox31: --- → affected

status-firefox-esr24: --- → affected

tracking-firefox30: --- → ?

tracking-firefox31: --- → ?

Whiteboard: [leave-open]

Al Billings [:abillings - ex-MoCo]

Updated

•

11 years ago

status-firefox29: affected → wontfix

tracking-firefox30: ? → +

tracking-firefox31: ? → +

Carsten Book [:Tomcat]

Comment 15

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/f9cecb6a3439

status-firefox32: affected → fixed

Flags: in-testsuite?

Target Milestone: --- → mozilla32

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 16

•

11 years ago

Test for this won't be checked in until any resulting ASAN issues are dealt with and we're clean including release

Boris Zbarsky [:bzbarsky]

Comment 17

•

11 years ago

> As in, a PRECISE timer can't get inserted at the head of the queue Correct. What it can do is add more events to the queue while its callback is running, but the queue is a queue.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 18

•

11 years ago

Can we get uplift nominations here for Aurora/Beta?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 19

•

11 years ago

Comment on attachment 8418193 [details] [diff] [review] Allow fake audio timers to skip if we overrun [Approval Request Comment] Bug caused by (feature/regressing bug #): N/A User impact if declined: DoS possible. Biggest impact is on test especially b2g emulator. May cause other security issues to be exposed by allowing a silly amount of load to be generated. Testing completed (on m-c, etc.): On m-c for some time. Risk to taking this patch (and alternatives if risky): extremely small String or IDL/UUID changes made by this patch: none

Attachment #8418193 - Flags: approval-mozilla-beta?

Attachment #8418193 - Flags: approval-mozilla-aurora?

Flags: needinfo?(rjesup)

Sylvestre Ledru [:Sylvestre]

Updated

•

11 years ago

Attachment #8418193 - Flags: approval-mozilla-beta?

Attachment #8418193 - Flags: approval-mozilla-beta+

Attachment #8418193 - Flags: approval-mozilla-aurora?

Attachment #8418193 - Flags: approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 20

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/0298a65d3875 https://hg.mozilla.org/releases/mozilla-beta/rev/6879a3064321 https://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/f99735a7b8ed https://hg.mozilla.org/releases/mozilla-b2g26_v1_2/rev/53e59fb3f5a4 Does this bug still need to be open? Also, does comment 14 still hold for esr24?

status-b2g-v1.2: affected → fixed

status-b2g-v1.3: affected → fixed

status-b2g-v2.0: --- → fixed

status-firefox30: affected → fixed

status-firefox31: affected → fixed

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 21

•

11 years ago

I'm going to suggest WONTFIX for ESR24 for now. It's not a direct sec bug itself, just a pretty poor form of DOS attack.

Flags: needinfo?(rjesup)

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

status-firefox-esr24: affected → wontfix

Resolution: --- → FIXED

Whiteboard: [leave-open]

Ryan VanderMeulen [:RyanVM]

Comment 22

•

11 years ago

https://hg.mozilla.org/releases/mozilla-b2g30_v1_4/rev/6879a3064321

status-b2g-v1.3T: affected → fixed

status-b2g-v1.4: affected → fixed

Al Billings [:abillings - ex-MoCo]

Updated

•

11 years ago

Whiteboard: [adv-main30+]

Daniel Veditz [:dveditz]

Updated

•

11 years ago

Group: media-core-security

Matt Wobensmith [:mwobensmith][:matt:]

Updated

•

11 years ago

QA Contact: kamiljoz

Kamil Jozwiak [:kjozwiak]

Comment 23

•

11 years ago

Attached file mcCrash.txt — Details

I went through the verification process using the following build(s): - http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-asan/latest/ (Changeset: 187089:7146e89a7b83) - http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-linux64-asan/latest/ (Changeset: 192965:d9edba5338ff) - http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64-asan/latest/ (Changeset: 192431:6da3dd0a2e7b) - M-C (Crashed within a minute of start up) - Aurora (Crashed within a minute of start up) - BETA (Didn't crash) (Used Ubuntu 13.10 on VM) - Opened the test case in a single tab and waited several minutes - Opened the test case in several different tabs - Opened the test case in several windows - Opened the test case in several e10s (m-c) - Opened the test case in several "Private Windows" - Opened the test case, waited a minute and then closed fx It looks like there's still a crash with m-c and aurora. When you open the test case from comment #0 and wait for about a minute, you'll end up receiving a crash. I'm not sure if this is the same crash as originally reported but it happens on both m-c and aurora 100% of the time. It appears like it doesn't affect beta as I couldn't reproduce the crash. I had about 20 instances of the test case opened in beta and never received a single crash. I've attached the crashes for both m-c (mcCrash.txt) and aurora (auroraCrash.txt). STR: - download/install either the m-c or aurora asan build (used linux) - once installed, run fx and open the test case from comment #0 - wait about one minute and it will automatically crash Let me know if there's anymore information that I can provide! Once I get home, I'll build the latest asan's in OSX and see if it happens there also.

Kamil Jozwiak [:kjozwiak]

Comment 24

•

11 years ago

Attached file auroraCrash.txt — Details

Kamil Jozwiak [:kjozwiak]

Comment 25

•

11 years ago

Randell, mind taking a quick look at the crashes reported above? The latest linux ASan builds under fx crash within a minute of opening the POC. The only one that doesn't crash is the beta build.

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 26

•

11 years ago

I can repro the MediaStreamGraph crash on nightly with the testcase. Per above, we blocked the primary vector for this, but hadn't found the asan bugs exposed by it. We hadn't been able to repro the mediastream one well; now we seem to (or more likely there were some contributing changes which re-exposed it). It does *not* seem to repro on a no-opt ASAN build, only a debug-opt build. Best guess is that the GraphImpl() is going away during the resample, which makes sense. I'll file a new bug on this issue; let's leave this bug on the event-queue-runaway issue (we have one on the timer already)

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

11 years ago

Blocks: 1022945

Kamil Jozwiak [:kjozwiak]

Updated

•

11 years ago

status-firefox30: fixed → verified

Justin Wood (:Callek)

Comment 27

•

11 years ago

Fixed in SeaMonkey 2.26.1 (based on Gecko 29) with https://hg.mozilla.org/releases/mozilla-release/rev/7a22ed0a3752

status-seamonkey2.26: --- → fixed

Matt Wobensmith [:mwobensmith][:matt:]

Comment 28

•

11 years ago

So, Randell or anyone, feel free to weigh in, but it appears that this bug was not fixed on Fx31 or Fx32. The related bug 1022945 is currently being tracked for Fx33 only. Shouldn't flags for one or both bugs indicate that Fx31 and Fx32 are affected? And that we are tracking as "affected" and/or "wontfix" for one or both?

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Group: core-security

You need to log in before you can comment on or make changes to this bug.