Closed Bug 691096 Opened 13 years ago Closed 13 years ago

Non-main-thread stack exhaustion crashes with hundreds of <audio autoplay>

Categories

(Core :: Audio/Video, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla10
Tracking Status
firefox7 - wontfix
firefox8 - wontfix
firefox9 - affected
firefox10 - affected
firefox11 --- affected
firefox12 --- verified
firefox-esr10 - wontfix
status1.9.2 --- wontfix

People

(Reporter: jruderman, Assigned: cajbir)

References

Details

(Keywords: crash, testcase, Whiteboard: [sg:dos][qa!])

Attachments

(5 files, 2 obsolete files)

No description provided.
Attached file crash report #1
Attached file crash report #2
This testcase usually causes Firefox to crash after a few seconds, or during shutdown if you hit Cmd+Q fast enough. Once, it made the Pandora AIR app stop working. And *something* crashed my OS last night; I don't know if it was related to this.
If I Cmd+F in "crash report #1" for nsThread::ThreadFunc I notice there's often one random-looking function in the middle of each stack. Ted, is this a breakpad bug? It confused my "is this crash known" routine, making this bug harder to find.
Do you mean like: 20 libnspr4.dylib!_pt_root [ptthread.c:f25928e4847d : 187 + 0x6] ? If so, I think that's just an artifact of you not having symbols for your system libraries.
Also, ouch: "Thread 325"
Ted, the parts I'm asking about are like > 14 XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:f25928e4847d : 245 + 0x1b] > 15 XUL!nsXULScrollFrame::DoLayout [nsGfxScrollFrame.cpp:f25928e4847d : 1276 + 0x9] > 16 XUL!nsThread::ThreadFunc [nsThread.cpp:f25928e4847d : 272 + 0xd] and > 14 XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:f25928e4847d : 245 + 0x1b] > 15 XUL!nsRefPtr<TableRowsCollection>::assign_with_AddRef [nsAutoPtr.h : 878 + 0x6] > 16 XUL!nsThread::ThreadFunc [nsThread.cpp:f25928e4847d : 272 + 0xd] sometimes #15 happens to be something on my list of functions that are known to contribute to crashes.
I haven't looked at your test case yet (network issues where I am) but it's pretty trivial to crash firefox with creating large numbers of audio elements. For example, the following crashes my firefox immediately on page load: + <html> + <body> + <script type="text/javascript"> + function playaudio() { + var a = document.getElementById('a'); + for(var i = 0; i < 1000; ++i) { + var b = a.cloneNode(true); + b.play(); + } + } + </script> + <audio id='a' style='display:none' src='shot.ogg' preload></audio> + <button onclick='playaudio()'>Play</button> + </html> The crash caused by this is due to running out of address space. I've raised this issue within the media group before and the response has been that it's being addressed with things like de-threadification, improved audio backend, etc.
My crashes don't seem to be running-out-of-address-space crashes. They look more like wild reads and wild jumps. 64-bit ftw?
We could just create a pref which sets a limit on the number of media threads we allow at once, and when we hit that limit we delay thread creation until we drop below the limit again. That would increase playback latency in cases like these, but at least we'd not crash. The problem with this approach is that we currently can't destroy audio play threads once we've started playing until playback is finished (we can't destroy them when playback is paused). We can destroy decode threads when playback is paused however, so you could render audio unplayable by creating a lot of audio elements, playing them, and then pausing them. Eventually the paused audio threads would hog the thread limit, and no new decoder threads could be created, so we couldn't play anything new to finish playback and free up more threads. We'd need the new audio backend landed, and our audio thread/playback refactored to make the above suggestion feasible.
I think we can limit the number of decode threads and audio threads to, say, 100 *each*. When we fix the audio backend stuff the audio limit can go away. The limit on decode threads will stick around; if you have hundreds of non-paused actively decoding elements, too bad... That seems pretty extreme.
Setting an arbitrary limit might stop this testcase from crashing, but probably won't make it impossible to crash due to this bug. I'd be more comfortable with a limit if you debugged this first.
Needs an owner, guessing at which Chris (please reassign if I've guessed wrong).
Assignee: nobody → chris
Whiteboard: [sg:critical?]
In the short run can we add a simple global cap to the number of audio streams?
Chris, any chance of a simple limit here that's safe enough for beta and aurora?
Running the test case on 64 bit linux I get a crash due to out of memory - which I assume is due to virtual address space running out.
That's a relatively reasonable behavior. On Mac I still get wild crashes. If you decide to add a cap to the number of audio streams, please do it in a separate (and public) bug report.
I have a mac at home - I'll test on that and see what I get. While debugging I saw recursive calls into the event loop at times. StopDecodeThread calls mDecodeThread->Shutdown which spins the event loop, and StopDecodeThread gets called again. Possibly related?
(In reply to Chris Double (:doublec) from comment #18) > I have a mac at home - I'll test on that and see what I get. While debugging > I saw recursive calls into the event loop at times. StopDecodeThread calls > mDecodeThread->Shutdown which spins the event loop, and StopDecodeThread > gets called again. Possibly related? Yeah possibly. Maybe we should refactor the threads to post events to shut them selves down when their shutdown sentinel is set. Then we wouldn't be using up stack frames when shutting down threads.
I was able to get a crash with a stack similar to attachment 563990 [details] on 64 bit linux after a few tries. Main thread stack trace: #0 0x00007fcfe8452a9c in mozilla::BlockingResourceBase::CheckAcquire (this=0x0, aCallContext=...) at /src/mozilla/mcgit/obj-x86_64-unknown-linux-gnu/xpcom/build/BlockingResourceBase.cpp:129 #1 0x00007fcfe8453166 in mozilla::Mutex::Lock (this=0x7fcf33cc63c0) at /src/mozilla/mcgit/obj-x86_64-unknown-linux-gnu/xpcom/build/BlockingResourceBase.cpp:260 #2 0x00007fcfe6ccf1f6 in mozilla::MutexAutoLock::MutexAutoLock (this=0x7fcfddd22100, aLock=..., _notifier=...) at ../../dist/include/mozilla/Mutex.h:184 #3 0x00007fcfe84c019b in nsThread::Shutdown (this=0x7fcf33cc63a0) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:470 #4 0x00007fcfe7a48ea2 in nsBuiltinDecoderStateMachine::StopDecodeThread (this=0x7fcfd8999f70) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1035 #5 0x00007fcfe7a4b3c5 in nsBuiltinDecoderStateMachine::RunStateMachine (this=0x7fcfd8999f70) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1540 #6 0x00007fcfe7a4ce06 in nsBuiltinDecoderStateMachine::CallRunStateMachine (this=0x7fcfd8999f70) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1914 #7 0x00007fcfe7a4cf59 in nsBuiltinDecoderStateMachine::TimeoutExpired (this=0x7fcfd8999f70) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1941 #8 0x00007fcfe7a4cea3 in TimeoutExpired (aTimer=0x7fcf34065820, aClosure=0x7fcfd8999f70) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1929 #9 0x00007fcfe84c824a in nsTimerImpl::Fire (this=0x7fcf34065820) at /src/mozilla/mcgit/xpcom/threads/nsTimerImpl.cpp:424 #10 0x00007fcfe84c86ed in nsTimerEvent::Run (this=0x7fcf86bb3970) at /src/mozilla/mcgit/xpcom/threads/nsTimerImpl.cpp:520 #11 0x00007fcfe84c07a1 in nsThread::ProcessNextEvent (this=0x7fcfcf1a0d00, mayWait=true, result=0x7fcfddd2250f) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:631 #12 0x00007fcfe844fecf in NS_ProcessNextEvent_P (thread=0x7fcfcf1a0d00, mayWait=true) at /src/mozilla/mcgit/obj-x86_64-unknown-linux-gnu/xpcom/build/nsThreadUtils.cpp:245 #13 0x00007fcfe84c02a2 in nsThread::Shutdown (this=0x7fcf33cc6100) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:494 #14 0x00007fcfe7a48ea2 in nsBuiltinDecoderStateMachine::StopDecodeThread (this=0x7fcfd899a600) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1035 #15 0x00007fcfe7a4b3c5 in nsBuiltinDecoderStateMachine::RunStateMachine (this=0x7fcfd899a600) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1540 #16 0x00007fcfe7a4ce06 in nsBuiltinDecoderStateMachine::CallRunStateMachine (this=0x7fcfd899a600) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1914 #17 0x00007fcfe7a4cf59 in nsBuiltinDecoderStateMachine::TimeoutExpired (this=0x7fcfd899a600) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1941 #18 0x00007fcfe7a4cea3 in TimeoutExpired (aTimer=0x7fcf34065700, aClosure=0x7fcfd899a600) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1929 #19 0x00007fcfe84c824a in nsTimerImpl::Fire (this=0x7fcf34065700) at /src/mozilla/mcgit/xpcom/threads/nsTimerImpl.cpp:424 #20 0x00007fcfe84c86ed in nsTimerEvent::Run (this=0x7fcf86b4eca0) at /src/mozilla/mcgit/xpcom/threads/nsTimerImpl.cpp:520 #21 0x00007fcfe84c07a1 in nsThread::ProcessNextEvent (this=0x7fcfcf1a0d00, mayWait=true, result=0x7fcfddd2294f) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:631 #22 0x00007fcfe844fecf in NS_ProcessNextEvent_P (thread=0x7fcfcf1a0d00, mayWait=true) at /src/mozilla/mcgit/obj-x86_64-unknown-linux-gnu/xpcom/build/nsThreadUtils.cpp:245 #23 0x00007fcfe84c02a2 in nsThread::Shutdown (this=0x7fcf34dedf20) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:494 .... #1154 0x00007fcfe7a48ea2 in nsBuiltinDecoderStateMachine::StopDecodeThread (this=0x7fcfc9a5e0c0) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1035 #1155 0x00007fcfe7a4b3c5 in nsBuiltinDecoderStateMachine::RunStateMachine (this=0x7fcfc9a5e0c0) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1540 #1156 0x00007fcfe7a4ce06 in nsBuiltinDecoderStateMachine::CallRunStateMachine (this=0x7fcfc9a5e0c0) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1914 #1157 0x00007fcfe7a4cf59 in nsBuiltinDecoderStateMachine::TimeoutExpired (this=0x7fcfc9a5e0c0) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1941 #1158 0x00007fcfe7a4cea3 in TimeoutExpired (aTimer=0x7fcf8b585da0, aClosure=0x7fcfc9a5e0c0) at /src/mozilla/mcgit/content/media/nsBuiltinDecoderStateMachine.cpp:1929 #1159 0x00007fcfe84c824a in nsTimerImpl::Fire (this=0x7fcf8b585da0) at /src/mozilla/mcgit/xpcom/threads/nsTimerImpl.cpp:424 #1160 0x00007fcfe84c86ed in nsTimerEvent::Run (this=0x7fcf86b5edc0) at /src/mozilla/mcgit/xpcom/threads/nsTimerImpl.cpp:520 #1161 0x00007fcfe84c07a1 in nsThread::ProcessNextEvent (this=0x7fcfcf1a0d00, mayWait=true, result=0x7fcfddd40dcf) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:631 #1162 0x00007fcfe844fecf in NS_ProcessNextEvent_P (thread=0x7fcfcf1a0d00, mayWait=true) at /src/mozilla/mcgit/obj-x86_64-unknown-linux-gnu/xpcom/build/nsThreadUtils.cpp:245 #1163 0x00007fcfe84bf67c in nsThread::ThreadFunc (arg=0x7fcfcf1a0d00) at /src/mozilla/mcgit/xpcom/threads/nsThread.cpp:272 #1164 0x00007fcfeb4e568b in _pt_root (arg=0x7fcfcd2f6670) at /src/mozilla/mcgit/nsprpub/pr/src/pthreads/ptthread.c:187 #1165 0x00007fcfec620971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #1166 0x00007fcfeb9d992d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #1167 0x0000000000000000 in ?? () Note the deeply nested stack and the spinning of the event loop. Other things I noticed - when I limited the number of decoder threads to anything greater than about 30 I'd get an assertion in nsBuiltinDecoderStateMachine::~nsBuiltinDecoderStateMachine() that we're not on the main thread.
In what way is comment 20 similar to attachment 563990 [details]?
Refactoring things so that we don't have deeply nested stacks as comment 20 then I get the test case running to completion sometimes (vs always crashing). But I get the same assertion I noted at the end of the comment about nsBuiltinDecoderStateMachine::~nsBuiltinDecoderStateMachine() not being on the main thread and any crashes that happen happen in nsCOMPtr<nsBuiltinDecoder>::nsCOMPtr.
(resetting status flags which I'm guessing got changed by mistake)
Removing the setting of stack size of the gStateMachineThread in nsBuiltinDecoderStateMachine::nsBuiltinDecoderStateMachine stops the crash for me in this test case - no doubt a higher iteration number will still crash, which seems to point to the stack frame usage during the event loop spinning in in the Shutdown calls in StopAudioThread and StopDecodeThread being the issue. Initial attempts at limiting the number of Decode threads created still result in non-main thread destruction of nsBuiltinDecoderStateMachine. This seems to occur in the runnable method associated with nsBuiltinDecoderStateMachine::AudioLoop, created in StartAudioThread: nsCOMPtr<nsIRunnable> event = NS_NewRunnableMethod(this, &nsBuiltinDecoderStateMachine::AudioLoop); mAudioThread->Dispatch(event, NS_DISPATCH_NORMAL); I assume the AudioLoop is exiting after there are no references held to the state machine somehow.
Assignee: chris → chris.double
Attached patch Fix (obsolete) — Splinter Review
* Increases the stack size of the single global state machine thread back to the default instead of the minimal media thread size. This stops the crash for the test case due to nesting too deeply and blowing the stack. * Limits the number of concurrent decoder threads to 50, to prevent too many threads blowing the stack in the future. The '50' was an arbitrarily chosen number - I can change it to whatever if needed. Will attach a crashtest patch shortly.
Attachment #570588 - Flags: review?(chris)
Can you explain how that fixes the security hole? Were those "wild crashes" actually hitting a stack limit guard page? Or are we missing guard pages? I wish crash reporters would describe the nature of the crash address :/
The patch fixes the crashes I'm seeing when running your test case. The 'single global state machine thread' had a minimal thread stack size set when it was created. This thread ends up having a deep recursive call due to spinning the event loop when calling shutdown on decoder threads. The depth of the stack at this point is in proportion to the number of decoder threads being shutdown. With a large number of threads this is growing past the size of the minimal stack and causing corruption. I do not know if stack limit guard pages exist if the stack size is changed to the minimal. I'm working on the assumption that this is causing the "wild crashes" you are seeing. If I change the minimal stack size back to normal the issue does not happen which seems to fit the nature of the problem to me. Of course even this large default stack size can be blown due to the recursive calls if enough concurrent threads are waiting to shutdown. So I limit the number of threads that can be created to prevent this. If the "wild crashes" are the security hole you refer to in comment 26 then I think this patch fixes it but I still have to test on a Mac system like you originally encountered the problem. Unfortunately our Mac laptops were stolen from our office recently so I can't test on the hardware you had until I do the crashtest patch and run that on tryserver. Please let me know if that's enough for you to feel that the issue is being addressed. If not, please tell me what more you need to feel assured.
Thanks. It's plausible enough that it's the same issue I was hitting. I wish I knew what was up with the debugger and/or guard pages, but it's not fair for me to demand that from the audio team.
In "Crash report #1" the instruction pointer and crash address are different, so I suspect we're missing end-of-stack guard pages :(
The "Crash address" varies per-platform, but typically it means the address where the fault occurred, so the bad read/write location for a segfault etc.
> In "Crash report #1" the instruction pointer and crash address are different, so I > suspect we're missing end-of-stack guard pages :( Filed bug 698528.
Attached patch crashtestSplinter Review
Adds Jesse's test as a crash test.
Attachment #570901 - Flags: review?(chris)
Comment on attachment 570588 [details] [diff] [review] Fix Review of attachment 570588 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. r=cpearce with minor changes. ::: content/media/nsBuiltinDecoderStateMachine.cpp @@ +185,5 @@ > +// only be one instance of this class. > +class StateMachineTracker > +{ > +public: > + StateMachineTracker() : Can constructor and destructor be private to enforce singleton-ness? @@ +217,5 @@ > + // Destroy global state machine thread if required. > + // Call on main thread only. > + void CleanupGlobalStateMachine(); > + > + // Return the global state machine thread. Call on main thread only (?) @@ +225,5 @@ > + // than this number are active the thread creation will fail. > + static const PRUint32 MAX_DECODE_THREADS = 50; > + > + // Returns the number of active decode threads. > + // Call o any thread. Holds the internal monitor so don't s/o/on/ @@ +1206,5 @@ > mDecoder->GetReentrantMonitor().AssertCurrentThreadIn(); > + PRUint32 count = 0; > + bool created = false; > + { > + ReentrantMonitorAutoExit mon(mDecoder->GetReentrantMonitor()); Indentation is too deep here. @@ +1218,5 @@ > + if (!mDecodeThread && count > StateMachineTracker::MAX_DECODE_THREADS) { > + mState = DECODER_STATE_SHUTDOWN; > + // Have to run one iteration of the state machine loop to ensure the > + // shutdown state is processed. > + CallRunStateMachine(); You should use ScheduleStateMachine() rather than calling CallStateMachine() directly. If something else has already called ScheduleStateMachine() before this code runs, you could end up having the state machine run again after this code runs to shut down the state machine. You need to call ScheduleStateMachine() before changing to SHUTDOWN state though. @@ +1230,5 @@ > if (NS_FAILED(rv)) { > mState = DECODER_STATE_SHUTDOWN; > + // Have to run one iteration of the state machine loop to ensure the > + // shutdown state is processed. > + CallRunStateMachine(); s/CallStateMachine/ScheduleStateMachine/ @@ +2155,5 @@ > } else if (!mDispatchedRunEvent) { > // We're not currently running this state machine on the state machine > // thread. Dispatch an event to run one cycle of the state machine. > mDispatchedRunEvent = true; > + return StateMachineTracker::Instance().GetGlobalStateMachineThread()->Dispatch(this, NS_DISPATCH_NORMAL); How about using GetStateMachineThread() instead of StateMachineTracker::Instance().GetGlobalStateMachineThread() to reduce line length?
Attachment #570588 - Flags: review?(chris) → review+
Attachment #570901 - Flags: review?(chris) → review+
Attached patch Fix + review comments addressed (obsolete) — Splinter Review
Addresses review comments, carries r+ forward. For the GetGlobalStateMachineThread I guarded it with the mutex instead of changing the comment to main thread only as it's used on other threads.
Attachment #570588 - Attachment is obsolete: true
Attachment #571193 - Flags: review+
Flags: in-testsuite+
OS: Mac OS X → All
Hardware: x86_64 → All
Target Milestone: --- → mozilla10
Looks like this was backed out of mozilla-inbound due to android crashtest orange: https://hg.mozilla.org/integration/mozilla-inbound/rev/c8b43b9b0398
I guess the limit of 50 is too large for android. I'll do some on-device testing tomorrow and see what's a reasonable number for that platform.
Attached patch FixSplinter Review
Testing on try server shows that 25 threads works fine, but 30 threads causes the crash test to fail on android so I've set the limit to 25.
Attachment #571193 - Attachment is obsolete: true
Attachment #572671 - Flags: review+
Status: NEW → ASSIGNED
Whiteboard: [sg:critical?] → [sg:critical]
Blocks: 700386
Chris, is this something we can take push for aurora, and even beta?
Whiteboard: [sg:critical] → [sg:critical][qa+]
Yes, I think it's safe for Aurora/Beta.
Attachment #572671 - Flags: approval-mozilla-beta?
Comment on attachment 572671 [details] [diff] [review] Fix [Triage Comment] Let's take this on both Aurora and Beta given the sg:crit status and bake time on m-c.
Attachment #572671 - Flags: approval-mozilla-beta?
Attachment #572671 - Flags: approval-mozilla-beta+
Attachment #572671 - Flags: approval-mozilla-aurora+
Does this affect 3.6? If so, is this patch applicable there as well? Thanks!
status1.9.2: --- → ?
Based on Glandium's analysis in bug 698528, I guess this isn't exploitable. Guard pages are sneaky, crash reporting software needs to be smarter.
Whiteboard: [sg:critical][qa+] → [qa+]
(In reply to Jesse Ruderman from comment #45) > Based on Glandium's analysis in bug 698528, I guess this isn't exploitable. > Guard pages are sneaky, crash reporting software needs to be smarter. Thanks Jesse, setting flags appropriately.
Attachment #572671 - Flags: approval-mozilla-beta-
Attachment #572671 - Flags: approval-mozilla-beta+
Attachment #572671 - Flags: approval-mozilla-aurora-
Attachment #572671 - Flags: approval-mozilla-aurora+
Whiteboard: [qa+] → [sg:dos][qa+]
I backed this out from beta and aurora in order to prevent a null-deref crash (bug 713381). I left this on central however, and I still need to fix bug 713381 in order for this to be stable. https://hg.mozilla.org/releases/mozilla-beta/rev/13c0a1cf6620 https://hg.mozilla.org/releases/mozilla-aurora/rev/e78e431b2354
I'm guessing this is a wontfix on ESR, but nominating for a second opinion.
(In reply to Daniel Veditz [:dveditz] from comment #48) > I'm guessing this is a wontfix on ESR, but nominating for a second opinion. Probably not worth the hassle; we've had this problem since the early days of <video>, and the sky hasn't fallen.
Summary: Wild crashes with hundreds of <audio autoplay> → Non-main-thread stack exhaustion crashes with hundreds of <audio autoplay>
Note the final fix for this bug requires both doublec' patch here, and my patches in bug 713381.
Group: core-security
I was able to reproduce the crash on FF 10.0.2. No crashes on FF 12. Verified fixed on Firefox 12b3: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0 Mozilla/5.0 (X11; Linux i686; rv:12.0) Gecko/20100101 Firefox/12.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:12.0) Gecko/20100101 Firefox/12.0
Status: RESOLVED → VERIFIED
Whiteboard: [sg:dos][qa+] → [sg:dos][qa!]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: