Closed Bug 713381 Opened 13 years ago Closed 13 years ago

Crash [@ nsRefPtr<nsBuiltinDecoder>::operator->() | nsBuiltinDecoderStateMachine::ScheduleStateMachine(__int64) ]

Categories

(Core :: Audio/Video, defect)

x86
All
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla12
Tracking Status
firefox10 + fixed
firefox11 + fixed
firefox12 --- verified

People

(Reporter: bc, Assigned: cpearce)

References

()

Details

(Keywords: crash, regression, reproducible, Whiteboard: [qa!][sg:dos])

Crash Data

Attachments

(5 files, 1 obsolete file)

1. http://js1k.com/2010-first/demo/730 2. Crash @ nsRefPtr<nsBuiltinDecoder>::operator->() | nsBuiltinDecoderStateMachine::ScheduleStateMachine(__int64) This crashes Windows Beta/10, Linux Beta/10, Aurora/11, Nightly/12 in Automation. I can reproduce Windows and Linux Beta/10 locally but not Linux Aurora/11, Nightly/12. In Windows XP, Firefox 9 doesn't crash but the process hangs after closing the browser, Beta/10 release hangs and Aurora/11 crashes bp-4a773a34-ef7f-44ac-b3cc-c68d72111224 and Nightly/12 release crashes bp-8f81c1c7-1cf8-43c1-8b34-bf99b2111224. On Linux Beta/10 release crashes bp-04882f02-55c7-45bb-b0c7-2961c2111224 Operating system: Windows NT 5.1.2600 Service Pack 3 CPU: x86 GenuineIntel family 6 model 37 stepping 1 1 CPU Crash reason: EXCEPTION_ACCESS_VIOLATION_READ Crash address: 0xc Thread 0 (crashed) 0 xul.dll!nsRefPtr<nsBuiltinDecoder>::operator->() [nsAutoPtr.h : 1055 + 0x3] eip = 0x0209820a esp = 0x0012bc50 ebp = 0x0012bc54 ebx = 0x0012beb4 esi = 0x058800b0 edi = 0xffffff87 eax = 0x0000000c ecx = 0x0000000c edx = 0x10e70c0c efl = 0x00010206 Found by: given as instruction pointer in context 1 xul.dll!nsBuiltinDecoderStateMachine::ScheduleStateMachine(__int64) [nsBuiltinDecoderStateMachine.cpp : 2134 + 0xa] eip = 0x02097934 esp = 0x0012bc5c ebp = 0x0012bca8 Found by: call frame info 2 xul.dll!nsBuiltinDecoderStateMachine::ScheduleStateMachine() [nsBuiltinDecoderStateMachine.cpp : 2130 + 0xb] eip = 0x02097913 esp = 0x0012bcb0 ebp = 0x0012bcbc Found by: call frame info 3 xul.dll!nsBuiltinDecoder::ScheduleStateMachineThread() [nsBuiltinDecoder.cpp : 252 + 0x7] eip = 0x0208e065 esp = 0x0012bcc4 ebp = 0x0012bcd4 Found by: call frame info 4 xul.dll!nsBuiltinDecoder::Play() [nsBuiltinDecoder.cpp : 259 + 0x7] eip = 0x0208e0d1 esp = 0x0012bcdc ebp = 0x0012bd00 Found by: call frame info 5 xul.dll!nsHTMLMediaElement::Play() [nsHTMLMediaElement.cpp : 1377 + 0x1a] eip = 0x01cbb467 esp = 0x0012bd08 ebp = 0x0012bd5c Found by: call frame info 6 xul.dll!nsHTMLAudioElement::Play() [nsHTMLAudioElement.h : 69 + 0xb] eip = 0x01cb687c esp = 0x0012bd64 ebp = 0x0012bd68 Found by: call frame info
I can't reproduce this crash locally on Win7 x64 or Linux x64. Are you doing anything other than just loading the page? (In reply to Bob Clary [:bc:] from comment #0) > 1. http://js1k.com/2010-first/demo/730 > 2. Crash @ nsRefPtr<nsBuiltinDecoder>::operator->() | > nsBuiltinDecoderStateMachine::ScheduleStateMachine(__int64) > > This crashes Windows Beta/10, Linux Beta/10, Aurora/11, Nightly/12 in > Automation. What does "in Automation" mean?
And can you find a regression range please?
(In reply to Chris Pearce, Mozilla Corporation (:cpearce) from comment #1) > I can't reproduce this crash locally on Win7 x64 or Linux x64. Are you doing > anything other than just loading the page? I just reproduced with a Nightly debug build from yesterday on Fedora 16 32 and 64 bit and 32 bit builds on Windows 7 32bit and 64 bit but not so far Windows XP or Mac 10.5 I started the browser on Linux with gdb --args and then reloaded until it crashed and on Windows 7 attached Visual Studio and did the same there. > What does "in Automation" mean? I run automated tests on the daily crash reports using windows xp, windows 7, linux and mac 10.6. > And can you find a regression range please? Perhaps Marcia can find someone to help?
https://crash-stats.mozilla.com/report/index/c778567f-bb07-454e-8ff5-ac3d82120107 http://sd.fortixonline.com/?level=11 Just starting to play by moving the figure towards north with a nightly I got this two times to shutdown however in my debug build which has a lot of debug printfs currently I could not reproduce it.
I can sometimes reproduce it with the fortixonline URL, thanks Bernd!
Assignee: nobody → chris
This is a regression from bug 703379.
Blocks: 703379
I can't reproduce with the fortixonline testcase. In fact, I can't figure out how that game is supposed to do anything at all; whatever I click on, nothing happens. Can't reproduce on the JS1K testcase either.
I can repro this on fortixonline, it can take a while though. I have a patch to fix this, just testing it and another patch. The object of the game it to expand your territory by drawing lines to add it to the border/existing territory, without the lines your drawing being hit by cannonballs, dragons etc. Took me a while to figure it out too.
I hit this on http://www.pirateslovedaisies.com/ after max. 2 minutes in the first level, two times in a row.
The root problem here is that we're starting the decoder and creating the state machine, but the state machine hits the thread limit when trying to create the decode thread and shuts down, annulling the decoder's ref to the state machine. The media element still has a ref to the decoder, so it tries to use it when calling nsBuiltinDecoder::Play(), and in Play() we call ScheduleStateMachineThread() which tries to deref the state machine pointer which has been annulled, and then we crash. The solution: don't move to the state machine to shutdown state when we hit the thread limit, instead queue up requests for threads and create new threads as old ones are destroyed. We then don't need to shutdown the state machine (and if we did need to shutdown the state machine, we should annull the media element's mDecoder ref). (The patch I mentioned in comment 8 didn't fix this root problem, I'm working on a new patch to do that.)
That means this bug must be a regression from the landing of bug 691096, which merged to m-c at approximately 2011-11-08 01:10:01 PST.
The patch from bug 691096 is shipping in Firefox 10 in 20 days (31 January). We should up the decode thread limit to make this less likely to occur and add a null check on beta branch to at least make this doesn't crash and then fix this properly on Aurora and central.
Upping the limit will result in triggering bug 691096 again though.
FWIW, I just got this crash trying the html5audio version of cut the rope with Nightly https://crash-stats.mozilla.com/report/index/bp-3731722c-47db-46f3-b508-5620c2120111
I couldn't reproduce this crash on beta channel (Firefox 10). I could only reproduce this bug in builds after the landing of bug 703379. The changes in bug 703379 made some media loads much faster, which may have caused this regression to show. Dão, were you able to reproduce this on Firefox 10?
I immediately reproduced this on beta/10 with a debug build from yesterday using Windows XP and http://js1k.com/2010-first/demo/730
Thanks Bob. Can you still reproduce the bug with the following builds: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cpearce@mozilla.com-530db384624c That's my attempt on beta to fix the bug, if this fixes it for you, we'll try to land this on beta.
(In reply to Chris Pearce, Mozilla Corporation (:cpearce) from comment #17) > Thanks Bob. Can you still reproduce the bug with the following builds: > http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cpearce@mozilla. > com-530db384624c I was able to reproduce a crash with my Try build on the JS1k page. I think I see how that can happen, I'll try another patch...
I was finally able to reproduce this crash reliably locally in an opt build on beta branch. This patch fixes the crash for me locally on beta. * Ensure we have null checks whenever we use nsBuiltinDecoder::mDecoderStateMachine in nsBuiltinDecoder. * Increase the decoder thread limit (except on Android, we had test failures on Android when this originally landed with a higher limit). The JS1K page no longer crashes for me. Without the increased thread limit, it doesn't work properly though, it must rely on the audio elements for timing.
Attachment #587851 - Flags: review?(roc)
Comment on attachment 587851 [details] [diff] [review] Patch 1 of 3: Add state machine null checks, increase decode thread limit [Approval Request Comment] Regression caused by (bug #): bug 691096 User impact if declined: Crashes when viewing pages with lots of audio/video elements, such as most HTLM5 games (see comments above). Testing completed (on m-c, etc.): Bug fixed in locally built copy of beta branch. Risk to taking this patch (and alternatives if risky): Low.
Attachment #587851 - Flags: approval-mozilla-beta?
(In reply to Chris Pearce, Mozilla Corporation (:cpearce) from comment #20) > Comment on attachment 587851 [details] [diff] [review] > Beta Patch - Add state machine null checks, increase decode thread limit > > [Approval Request Comment] > Testing completed (on m-c, etc.): Bug fixed in locally built copy of beta > branch. Tests greenish on Try (modulo the talos tests, which aren't working on a Beta -> Try push): https://tbpl.mozilla.org/?tree=Try&rev=846e9024ded0
Comment on attachment 587851 [details] [diff] [review] Patch 1 of 3: Add state machine null checks, increase decode thread limit We should take this on aurora as well, for all the same reasons as for beta.
Attachment #587851 - Flags: approval-mozilla-aurora?
Blocks: 691096
No longer blocks: 703379
Comment on attachment 587851 [details] [diff] [review] Patch 1 of 3: Add state machine null checks, increase decode thread limit Change of plan, let's backout bug 691096 instead...
Attachment #587851 - Flags: approval-mozilla-beta?
Attachment #587851 - Flags: approval-mozilla-aurora?
[Approval Request Comment] Regression caused by (bug #): 691096 User impact if declined: We will sporadically (and sometimes reliably) crash when loading many HTML5 game sites that load and/or play 25 or more HTML5 audio elements concurrently (Pirates Love Daisies, fortixonline, JS1K demos). Testing completed (on m-c, etc.): This is a backout, we worked fine before this regressing patch landed... Risk to taking this patch (and alternatives if risky): We should take this backout patch on aurora and beta, and fix the issue properly on central. Once we take this patch we'll be vulnerable to bug 691096 again on aurora/beta. This means if we load a very large number of media elements concurrently (200+), we'll crash anyway. However we've been vulnerable to that for years and the sky never fell, so I think it's OK to wait for us to fix this properly on central and wait for that fix to flow through to release. The alternative is to take my patch (attachment 587851 [details] [diff] [review]) which fixes the crash in bug 691096's patch by adding null checks. However with that patch, when we hit the enforced limit on decode threads (25 or whatever we set it to) we end up making media unplayable, which will break sites as well. So I think we should just back out the regressing patches and fix it properly.
Attachment #588169 - Flags: approval-mozilla-beta?
Attachment #588169 - Flags: approval-mozilla-aurora?
Comment on attachment 588169 [details] [diff] [review] Patch - backout bug 691096 [Triage Comment] Approved for aurora and beta since the crashes that we fixed prior is significantly less than the crashes the backout will fix.
Attachment #588169 - Flags: approval-mozilla-beta?
Attachment #588169 - Flags: approval-mozilla-beta+
Attachment #588169 - Flags: approval-mozilla-aurora?
Attachment #588169 - Flags: approval-mozilla-aurora+
This is based on top of my earlier patch "Add state machine null checks, increase decode thread limit". We should take that patch on central (I'm not sure if we need/should take the thread limit increase, but we should definitely take the addition of the null checks). *This* patch queues up requests for new decode threads when the limit is reached, rather than shutting down the decoder. I had to disable a spam-media-elements test on android, since it appears the platform sound APIs can't handle playing large numbers of sounds at once. Green on try: https://tbpl.mozilla.org/?tree=Try&rev=b39bf0a3543f
Attachment #588290 - Flags: review?(roc)
Backed out Bug 691096 from: beta: https://hg.mozilla.org/releases/mozilla-beta/rev/13c0a1cf6620 aurora: https://hg.mozilla.org/releases/mozilla-aurora/rev/e78e431b2354 I still need to land the null-checks patch and the queue thread requests patch on central to fix this on central.
Comment on attachment 588290 [details] [diff] [review] Patch: Queue requests for decode thread creation when limit reached Review of attachment 588290 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/media/nsBuiltinDecoderStateMachine.cpp @@ +339,5 @@ > --mDecodeThreadCount; > + while (mDecodeThreadCount < MAX_DECODE_THREADS && !mPending.IsEmpty()) { > + nsBuiltinDecoderStateMachine* m = > + static_cast<nsBuiltinDecoderStateMachine*>(mPending.ElementAt(0)); > + mPending.RemoveElementAt(0); This is going to scale poorly when the array gets really large. Use nsDeque? ::: content/media/test/manifest.js @@ +4,5 @@ > > // These are small test files, good for just seeing if something loads. We > // really only need one test file per backend here. > var gSmallTests = [ > + { name:"small-shot.ogg", type:"audio/ogg", duration:0.276 }, Not sure why you move this.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #29) > Comment on attachment 588290 [details] [diff] [review] > Patch: Queue requests for decode thread creation when limit reached > > Review of attachment 588290 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: content/media/nsBuiltinDecoderStateMachine.cpp > @@ +339,5 @@ > > --mDecodeThreadCount; > > + while (mDecodeThreadCount < MAX_DECODE_THREADS && !mPending.IsEmpty()) { > > + nsBuiltinDecoderStateMachine* m = > > + static_cast<nsBuiltinDecoderStateMachine*>(mPending.ElementAt(0)); > > + mPending.RemoveElementAt(0); > > This is going to scale poorly when the array gets really large. Use nsDeque? Then how can we remove decoders from the middle of the nsDeque in order to cancel a pending request? I could change to using a list I suppose. > ::: content/media/test/manifest.js > @@ +4,5 @@ > > > > // These are small test files, good for just seeing if something loads. We > > // really only need one test file per backend here. > > var gSmallTests = [ > > + { name:"small-shot.ogg", type:"audio/ogg", duration:0.276 }, > > Not sure why you move this. It makes the test faster, without this change for the new test we'll use r11025_s16_c1.wav which is 1s long, whereas small-shot.ogg is only 0.276s long, and since we can only decode a limited number in parallel, we effectively play elements in a "sliding window" over the range of elements we're playing. So making the elements' resource shorter makes the test faster.
(In reply to Chris Pearce (:cpearce) from comment #30) > Then how can we remove decoders from the middle of the nsDeque in order to > cancel a pending request? I could change to using a list I suppose. The ability to remove an element could be added to nsDeque.
Adding the ability to remove an element from nsDeque seems like the best approach, I'll do that.
We'd like to use nsDeque to queue a list of decoders waiting for new threads once the media-decode-thread limit is reached, but we need to be able to cancel a request for a thread, and remove it from the deque. nsDeque doesn't have the ability to remove elements, so I've implemented it here, and re-enabled its tests.
Attachment #588290 - Attachment is obsolete: true
Attachment #588290 - Flags: review?(roc)
Attachment #589371 - Flags: review?(benjamin)
Using nsDeque instead of nsAutoTArray.
Attachment #589373 - Flags: review?(roc)
Attachment #589371 - Attachment description: Patch 1 of 2: Add nsDeque::RemoveObjectAt(index) → Patch 2 of 3: Add nsDeque::RemoveObjectAt(index)
Attachment #589373 - Attachment description: Patch 2 of 3: Queue requests for decode threads when the limit is reached → Patch 3 of 3: Queue requests for decode threads when the limit is reached
Attachment #587851 - Attachment description: Beta Patch - Add state machine null checks, increase decode thread limit → Patch 1 of 3: Add state machine null checks, increase decode thread limit
Attachment #589371 - Flags: review?(benjamin) → review+
Patches 1-3 inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/38271572005b https://hg.mozilla.org/integration/mozilla-inbound/rev/d5ebbc25b4b8 https://hg.mozilla.org/integration/mozilla-inbound/rev/6435f51dd10d Note I only landed the change to add null checks from patch 1, I didn't land the change to increase the thread limit, as we don't need that when we have patch 3, where requests for threads are queued.
Target Milestone: --- → mozilla12
Test timeout, and failure to run subsequent media tests: https://tbpl.mozilla.org/php/getParsedLog.php?id=8647352&tree=Mozilla-Inbound I'll back out...
The test failures are on Linux only, and appear to be caused by audio streams failing to close/shut down (can't quite tell, the call stacks don't have function names), which is causing the audio threads to not shutdown. This is a known issue on Linux, the fix really is to wait for libcucbeb to rescue us and to get rid of the audio threads. So in the meantime, this patch changes the test to not play, but still ensure we decode - since we can control the lifecycle of the decode threads, and that's what this bug was really all about anyway. We can deal with the audio threads problem after libcubeb lands.
Attachment #589796 - Flags: review?(roc)
Whiteboard: [qa+]
Mozilla/5.0 (Windows NT 6.0; rv:12.0a2) Gecko/20120217 Firefox/12.0a2 Mozilla/5.0 (Windows NT 6.0; rv:11.0) Gecko/20100101 Firefox/11.0 This still occurs on Windows Vista on Firefox 11 Beta 3, latest Firefox 12 Aurora, latest Firefox Nightly. Crash when loading http://js1k.com/2010-first/demo/730 https://crash-stats.mozilla.com/report/index/bp-57aae22f-6829-44f4-a908-001a12120217 https://crash-stats.mozilla.com/report/index/bp-072977ad-9739-4697-b7ee-145e72120217 https://crash-stats.mozilla.com/report/index/bp-d473ab17-9822-43af-a768-f485b2120217 Doesn't occur on Windows 7 x86. Tried with 11 Beta 3 and Nightly. Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
Firefox 11.0 Crash Report [@ TList<CMixer>::RemoveAt(void*) ] Firefox 13.0a1 Crash Report [@ TList<CMixer>::RemoveAt(void*) ] looks like bug 556434 UNCONFIRMED Firefox 3.6.2 Crash [@ TList<CMixer>::RemoveAt(void*) ] Looking at the "more reports" links, these do appear to be primarily Vista crashes. :-/
Keywords: reproducible
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0 Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20100101 Firefox/11.0 Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0 In addition to verification on Windows 7 as in comment 42, verified on Windows XP, Mac OS 10.7 and Ubuntu 11.10 (both 32 bits). No crashes occurred when loading the page. While looking at crash stats though, I can still see 2 crashes for Windows XP related to this signature for F11Beta: https://crash-stats.mozilla.com/report/index/7a219723-7d68-47b0-b7a9-926952120228 https://crash-stats.mozilla.com/report/index/276f39e6-756b-4197-91a4-75dc62120207 Is this expected in any way?
(In reply to Virgil Dicu [:virgil] [QA] from comment #44) > https://crash-stats.mozilla.com/report/index/7a219723-7d68-47b0-b7a9- > 926952120228 This is 11.0b3 > https://crash-stats.mozilla.com/report/index/276f39e6-756b-4197-91a4- > 75dc62120207 This is 11.0b1 No reports on 11.0b4. Any reason we can't conclude this is fixed and just reopen or file a new if the signature becomes explosive?
I see a report from Beta 8: https://crash-stats.mozilla.com/report/index/f72c1e99-7dd9-49fe-b179-98e792120312 in the first signature. But this is a low enough volume crash that I don't think there is too much cause for worry.
Whiteboard: [qa+] → [qa+][sg:dos]
Mozilla/5.0 (X11; Linux i686; rv:12.0) Gecko/20100101 Firefox/12.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:12.0) Gecko/20100101 Firefox/12.0 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0 Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0 No crash with F 12 beta 3 with the link from comment 0 on Windows XP, 7, Mac OS 10.6 and Ubuntu 11.10. No crash reports for F 12.
Status: RESOLVED → VERIFIED
Whiteboard: [qa+][sg:dos] → [qa!][sg:dos]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: