744836 - crash @ RtlEnterCriticalSection | PR_Lock | nsThread::Dispatch | nsTimerImpl::PostTimerEvent

Reporter

Description

•

13 years ago

It's #34 top browser crasher in 11.0, #31 in 12.0b4, #46 in 13.0a2 and #113 in 14.0a1.

On comment says on April 12th: "This crash happened after the latest mcrosoft update" bp-13f916c4-c46f-47ee-83d2-92cf22120412

Signature 	RtlEnterCriticalSection | PR_Lock | nsThread::Dispatch(nsIRunnable*, unsigned int) More Reports Search
UUID	3cc13d95-0344-4e93-854e-3840a2120412
Date Processed	2012-04-12 11:45:27
Uptime	98
Install Age	8.1 hours since version was first installed.
Install Time	2012-04-12 03:41:16
Product	Firefox
Version	13.0a2
Build ID	20120411042011
Release Channel	aurora
OS	Windows NT
OS Version	5.1.2600 Service Pack 2
Build Architecture	x86
Build Architecture Info	GenuineIntel family 15 model 2 stepping 4
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0x30
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x002d, AdapterSubsysID: 001e10de, AdapterDriverVersion: 5.6.7.3
D3D10 Layers? D3D10 Layers- D3D9 Layers? D3D9 Layers- 
EMCheckCompatibility	True	
Total Virtual Memory	2147352576
Available Virtual Memory	1935941632
System Memory Use Percentage	31
Available Page File	1744617472
Available Physical Memory	553472000

Frame 	Module 	Signature 	Source
0 	ntdll.dll 	RtlEnterCriticalSection 	
1 	nspr4.dll 	PR_Lock 	nsprpub/pr/src/threads/combined/prulock.c:233
2 	xul.dll 	nsThread::Dispatch 	xpcom/threads/nsThread.cpp:440
3 	xul.dll 	nsTimerImpl::PostTimerEvent 	xpcom/threads/nsTimerImpl.cpp:628
4 	xul.dll 	TimerThread::Run 	xpcom/threads/TimerThread.cpp:319
5 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:657
6 	xul.dll 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:289
7 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
8 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
9 	msvcr100.dll 	_callthreadstartex 	f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c:314
10 	msvcr100.dll 	_threadstartex 	f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c:292
11 	kernel32.dll 	BaseThreadStart 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=RtlEnterCriticalSection+|+PR_Lock+|+nsThread%3A%3ADispatch%28nsIRunnable*%2C+unsigned+int%29

laceywww

Comment 1

•

12 years ago

Attached audio test audio file for crash testcase. — Details

This is a test audio file for a testcase that I am uploading. There is nothing special about the audio file itself, it's only useful since it's short and can be looped quickly to help find an intermittent error.

laceywww

Updated

•

12 years ago

Attachment #660574 - Attachment mime type: application/octet-stream → audio/ogg

laceywww

Comment 2

•

12 years ago

Attached file Crash testcase. loops audio file leadin to a crash. — Details

I hope this is the right bug to attach this testcase to(the signature matches the crash I'm getting: https://crash-stats.mozilla.com/report/index/bp-b1c8c6b4-a521-4c87-8be9-051ec2120912)

The crash should occur soon after pressing "play" on the <audio> element. It will loop repeatedly until it crashes; sometimes this occurs quickly, sometimes it takes several minutes or more. This will crash on two computers I've tested, both running FF 15.0.1, another running the current Aurora build; one is under windows 7, the other windows XP. The crash occurs as the song ends and a "new" source is set. It only occurs when setting the src element after playing; other tests looping without setting source had no problem, nor did setting source repeatedly without playing cause a problem.

laceywww

Updated

•

12 years ago

Attachment #660578 - Attachment mime type: text/plain → text/html

Scoobidiver (away)

Reporter

Comment 3

•

12 years ago

It doesn't crash for me.

laceywww

Comment 4

•

12 years ago

Interesting. How long did you let it run? I've had one going for ~20 minutes now and it hasn't crashed(though it did earlier several times); but on another machine it just crashed within probably less than a minute. I should also note that I've only tested the crash under single-core systems(one a laptop running Windows 7, another a vmware instance running Windows XP on top of Ubuntu 10.04, restricted to one core). Testing it now under a dual-core vmware instance.

laceywww

Comment 5

•

12 years ago

The instance mentioned in my prior comment that had run ~20 minutes, crashed at ~35 minutes; so clearly, my earlier assertion that it should fail "soon" isn't always true(though so far it does seem to fail soon most of the time for me: after failing at ~35 minutes, it failed again several times in a row in just seconds). A dual-core WindowsXP instance under vmware crashed at around 38 minutes, then again at 14 minutes.

Tyler Downer [He/Him]

Comment 6

•

12 years ago

I have a user still experiencing this in 18, https://crash-stats.mozilla.com/report/index/ccdd4bbe-2f5c-4096-b506-2448d2130116, any progress or should we do 1:1?

Version: 11 Branch → 18 Branch

Robert Kaiser

Comment 7

•

12 years ago

This signature is now entering the top 10 crashes on Aurora 21.

Benjamin Smedberg

Comment 8

•

12 years ago

Ehsan, do you have any clues about this one? It appears that we're posting an event to a dead thread, but that would mean that we have a refcounting error (because TimerImpl::mEventTarget is a strong ref). Otherwise this means that we're dispatching from a dead nsTimerImpl?

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

12 years ago

Assignee: nobody → aklotz

(no longer active)

Comment 9

•

12 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> Ehsan, do you have any clues about this one? It appears that we're posting
> an event to a dead thread, but that would mean that we have a refcounting
> error (because TimerImpl::mEventTarget is a strong ref). Otherwise this
> means that we're dispatching from a dead nsTimerImpl?

NS_GetCurrentThread() can return null, right?  What if mEventTarget is just null?  (That doesn't seem that implausible given the crash address of 0x20.)

Benjamin Smedberg

Comment 10

•

12 years ago

Yeah, I somehow missed that clue.

NS_GetCurrentThread should only be able to return null before nsThreadManager::Init or after nsThreadManager::Shutdown. So really mEventTarget should never have the opportunity to be null in normal operation. Perhaps somebody is trying to create a timer very early in startup?

So I guess we can wallpaper this by just null-checking mEventTarget in nsTimerImpl::PostTimerEvent, but we should probably also check it nsTimerImpl::InitCommon.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 11

•

12 years ago

Attached patch null checking for mEventTarget (obsolete) — Details — Splinter Review

This theory makes perfect sense to me as well.

Attachment #724498 - Flags: review?(benjamin)

(no longer active)

Comment 12

•

12 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> Yeah, I somehow missed that clue.
> 
> NS_GetCurrentThread should only be able to return null before
> nsThreadManager::Init or after nsThreadManager::Shutdown. So really
> mEventTarget should never have the opportunity to be null in normal
> operation. Perhaps somebody is trying to create a timer very early in
> startup?

That seems implausible given the main thread stack in the crash report in comment 6, IINM:

0 	ntdll.dll 	NtWaitForSingleObject 	
1 	KERNELBASE.dll 	WaitForSingleObjectEx 	
2 	kernel32.dll 	WaitForSingleObjectExImplementation 	
3 	kernel32.dll 	WaitForSingleObject 	
4 	nspr4.dll 	PR_Wait 	nsprpub/pr/src/threads/prmon.c:152
5 	xul.dll 	mozilla::ReentrantMonitor::Wait 	obj-firefox/dist/include/mozilla/ReentrantMonitor.h:90
6 	xul.dll 	nsThreadStartupEvent::Wait 	xpcom/threads/nsThread.cpp:174
7 	xul.dll 	nsThread::Init 	xpcom/threads/nsThread.cpp:342
8 	xul.dll 	nsThreadManager::NewThread 	xpcom/threads/nsThreadManager.cpp:215
9 	xul.dll 	NS_NewThread_P 	obj-firefox/xpcom/build/nsThreadUtils.cpp:49
10 	xul.dll 	NS_NewNamedThread_P<12> 	obj-firefox/dist/include/nsThreadUtils.h:88
11 	xul.dll 	StateMachineTracker::EnsureGlobalStateMachine 	content/media/nsBuiltinDecoderStateMachine.cpp:267
12 	xul.dll 	nsBuiltinDecoderStateMachine::nsBuiltinDecoderStateMachine 	content/media/nsBuiltinDecoderStateMachine.cpp:413
13 	xul.dll 	nsOggDecoder::CreateStateMachine 	content/media/ogg/nsOggDecoder.cpp:13
14 	xul.dll 	nsBuiltinDecoder::Load 	content/media/nsBuiltinDecoder.cpp:328
15 	xul.dll 	nsHTMLMediaElement::FinishDecoderSetup 	content/html/content/src/nsHTMLMediaElement.cpp:2665
16 	xul.dll 	nsHTMLMediaElement::InitializeDecoderForChannel 	content/html/content/src/nsHTMLMediaElement.cpp:2640
17 	xul.dll 	nsHTMLMediaElement::MediaLoadListener::OnStartRequest 	content/html/content/src/nsHTMLMediaElement.cpp:345
18 	xul.dll 	mozilla::net::nsHttpChannel::CallOnStartRequest 	netwerk/protocol/http/nsHttpChannel.cpp:962
19 	xul.dll 	mozilla::net::nsHttpChannel::ContinueOnStartRequest2 	netwerk/protocol/http/nsHttpChannel.cpp:4859
20 	xul.dll 	mozilla::net::nsHttpChannel::OnStartRequest 	netwerk/protocol/http/nsHttpChannel.cpp:4832
21 	xul.dll 	nsInputStreamPump::OnStateStart 	netwerk/base/src/nsInputStreamPump.cpp:417
22 	xul.dll 	nsInputStreamReadyEvent::Run 	xpcom/io/nsStreamUtils.cpp:82
23 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:620
24 	xul.dll 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:117
25 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:208
26 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:182
27 	xul.dll 	nsBaseAppShell::Run 	widget/xpwidgets/nsBaseAppShell.cpp:163
28 	xul.dll 	nsAppShell::Run 	widget/windows/nsAppShell.cpp:232
29 	xul.dll 	nsAppStartup::Run 	toolkit/components/startup/nsAppStartup.cpp:290
30 	xul.dll 	XREMain::XRE_mainRun 	toolkit/xre/nsAppRunner.cpp:3794
31 	xul.dll 	XREMain::XRE_main 	toolkit/xre/nsAppRunner.cpp:3860
32 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3935
33 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:105

I think we should look at the crash dump and try to see what we're missing.  This _could_ be a memory corruption or something like that too.  I don't think we should be too quick to take the wallpaper patch.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 13

•

12 years ago

Fair enough, I'll keep digging.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 14

•

12 years ago

I might be completely out to lunch on this, but I offer some food for thought:

As we can see in the dump from comment 6, the main thread is in the midst of creating the state machine thread in StateMachineTracker::EnsureGlobalStateMachine. 

OTOH, there is a timer being set in nsBuiltinDecoderStateMachine::ScheduleStateMachine. nsITimer::SetTarget is being called with the return value from GetStateMachineThread. Since the state machine thread hasn't been created yet, GetStateMachineThread would be nullptr, thus feeding a mythical NULL mEventTarget into the timer.

There are numerous checks to ensure that this doesn't happen but they are all macros that are no-ops in non-debug builds AFAICT.

The one problem with this theory is that it's not clear to me who could be calling ScheduleStateMachine at this point. It looks like it's so early in the decoder initialization that this can't happen.

(no longer active)

Comment 15

•

12 years ago

(In reply to Aaron Klotz [:aklotz] from comment #14)
> I might be completely out to lunch on this, but I offer some food for
> thought:
> 
> As we can see in the dump from comment 6, the main thread is in the midst of
> creating the state machine thread in
> StateMachineTracker::EnsureGlobalStateMachine. 
> 
> OTOH, there is a timer being set in
> nsBuiltinDecoderStateMachine::ScheduleStateMachine. nsITimer::SetTarget is
> being called with the return value from GetStateMachineThread. Since the
> state machine thread hasn't been created yet, GetStateMachineThread would be
> nullptr, thus feeding a mythical NULL mEventTarget into the timer.
> 
> There are numerous checks to ensure that this doesn't happen but they are
> all macros that are no-ops in non-debug builds AFAICT.
> 
> The one problem with this theory is that it's not clear to me who could be
> calling ScheduleStateMachine at this point. It looks like it's so early in
> the decoder initialization that this can't happen.

With a quick glance, you might be on to something here.  I've looked a bit at the guts of our decoder infrastructure before, but I wouldn't dare say that I understand much of it.  Chris however should have a good grasp on this code.

Chris, do you see how this can happen?  I'd appreciate if you can either validate or discredit our theories here.

Flags: needinfo?(cpearce)

Benjamin Smedberg

Comment 16

•

12 years ago

Comment on attachment 724498 [details] [diff] [review]
null checking for mEventTarget

I don't allow NS_ENSURE_* in my modules, and in this case I think it should assert, so please use

if (!mEventTarget) {
  NS_ERROR("...";
  return NS_ERROR_NOT_INITIALIZED;
}

Attachment #724498 - Flags: review?(benjamin) → review-

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 17

•

12 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) from comment #15)
> (In reply to Aaron Klotz [:aklotz] from comment #14)
> > I might be completely out to lunch on this, but I offer some food for
> > thought:
> > 
> > As we can see in the dump from comment 6, the main thread is in the midst of
> > creating the state machine thread in
> > StateMachineTracker::EnsureGlobalStateMachine. 
> > 
> > OTOH, there is a timer being set in
> > nsBuiltinDecoderStateMachine::ScheduleStateMachine. nsITimer::SetTarget is
> > being called with the return value from GetStateMachineThread. Since the
> > state machine thread hasn't been created yet, GetStateMachineThread would be
> > nullptr, thus feeding a mythical NULL mEventTarget into the timer.
> > 
> > There are numerous checks to ensure that this doesn't happen but they are
> > all macros that are no-ops in non-debug builds AFAICT.

We can make some of the checks non-no-ops in non-debug builds if it would help to narrow the cause of this crash.

> > 
> > The one problem with this theory is that it's not clear to me who could be
> > calling ScheduleStateMachine at this point. It looks like it's so early in
> > the decoder initialization that this can't happen.

I agree, I don't see how ScheduleStateMachine() could be called this early on during a load.

We only publish a reference to the decoder and thus make it/ScheduleStateMachine() callable by the HTMLMediaElement later on in nsHTMLMediaElement::FinishDecoderSetup(), after MediaDecoder/nsBuiltinDecoder::Load() has finished, but in the stack above we're still working our way through Load().

And the only code that calls ScheduleStateMachine() should either be run on the state machine thread, or be initiated by calls to the MediaDecoder reference we publish after Load() has finished.

We're also quite careful to cancel the timer before we destroy the state machine and state machine thread, and are careful to drop the global reference to the thread before destroying the thread and so forth.

So I don't see how this could happen, and if it could happen I'd expect our tests to catch it.

Having said that, we've been stung enough times by this area of code that I won't rule it out. ;)

Flags: needinfo?(cpearce)

Benjamin Smedberg

Updated

•

12 years ago

Blocks: 842004

Scoobidiver (away)

Reporter

Updated

•

12 years ago

No longer blocks: 842004

Scoobidiver (away)

Reporter

Updated

•

12 years ago

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 19

•

12 years ago

Attached patch null checking for mEventTarget, v2 — Details — Splinter Review

Incorporated the changes requested by Benjamin.

Attachment #724498 - Attachment is obsolete: true

Attachment #740903 - Flags: review?(benjamin)

Benjamin Smedberg

Updated

•

12 years ago

Attachment #740903 - Flags: review?(benjamin) → review+

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 20

•

12 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f8077e8edd44

Ryan VanderMeulen [:RyanVM]

Comment 21

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/f8077e8edd44

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla23

Scoobidiver (away)

Reporter

Comment 22

•

12 years ago

There are three crashes after the patch landing: bp-09cdb49e-e81a-4f1a-a68f-fc6e92130426, bp-2c2bc678-b4e7-4634-a222-211d62130426 and bp-be603078-be7c-4de0-bbcb-9608a2130426.

(no longer active)

Updated

•

12 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Scoobidiver (away)

Reporter

Comment 23

•

12 years ago

It's #8 top browser crasher in 20.0.1, #7 in 21.0b3, #10 in 22.0a2, and #76 in 23.0a1.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 24

•

12 years ago

I've done some more digging as to what is going on and why many of these crashes are at address 0x20:

The mLock field is a mozilla::Mutex. That object uses a PRLock, which is allocated off the heap. The CRITICAL_SECTION structure is located at offset 0x1c from the beginning of the PRLock structure, and EnterCriticalSection first attempts to modify CRITICAL_SECTION::LockCount, which is at offset 4 from the beginning of the CS, hence our offset of 0x20.

Since mozilla::Mutex sets the PRLock pointer to 0 upon destruction, we're looking at a use-after-free situation with respect to the nsThread object.

(no longer active)

Comment 25

•

12 years ago

Can it not be a null dereference?

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 26

•

12 years ago

It is a dereference of a null PRLock pointer contained within the nsThread, but pointer to the nsThread object itself would be non-null.

Geoff Brown [:gbrown]

Comment 27

•

12 years ago

Bug 761987 is a long-standing Android crash bug with thousands of failure reports. Some of those logs appear to be mis-filed and contain stacks very similar to this bug. 

For example: 

https://tbpl.mozilla.org/php/getParsedLog.php?id=22968490&tree=Mozilla-Inbound#error0 contains

Crash reason:  SIGSEGV
Crash address: 0x3c

Thread 19 (crashed)
 0  libnss3.so!PR_Lock [ptsynch.c : 184 + 0x2]
     r4 = 0x00000000    r5 = 0x603ffdec    r6 = 0x8007000e    r7 = 0x603ffdd8
     r8 = 0x0d43053f    r9 = 0x00000094   r10 = 0x603ffe9f    fp = 0x5f81d9cc
     sp = 0x603ffdd0    lr = 0x5f3eb91d    pc = 0x5f3eb91e
    Found by: given as instruction pointer in context
 1  libxul.so!nsThread::GetObserver(nsIThreadObserver**) [Mutex.h : 74 + 0x3]
     sp = 0x603ffdd8    pc = 0x63354023
    Found by: stack scanning
 2  libxul.so!nsThread::PutEvent(nsIRunnable*) [nsThread.h : 70 + 0x7]
     sp = 0x603ffde8    pc = 0x633547ab
    Found by: stack scanning
 3  libxul.so!nsTimerImpl::PostTimerEvent() [nsTimerImpl.cpp:1d3b60ab634a : 676 + 0x9]
     sp = 0x603ffe00    pc = 0x63356993
    Found by: stack scanning
 4  libxul.so!TimerThread::Run() [TimerThread.cpp : 232 + 0x5]
     sp = 0x603ffe18    pc = 0x63356e8b
    Found by: stack scanning
 5  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp : 627 + 0x5]
     sp = 0x603ffe50    pc = 0x63354495
    Found by: stack scanning
 6  libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp : 238 + 0x9]
     sp = 0x603ffe98    pc = 0x6333274b
    Found by: stack scanning
 7  libxul.so!nsTransportEventSinkProxy::QueryInterface(nsID const&, void**) [nsTransportUtils.cpp : 92 + 0x11]
     sp = 0x603ffe9c    pc = 0x629d14dd
    Found by: stack scanning
 8  libxul.so!nsThread::ThreadFunc(void*) [nsThread.cpp : 265 + 0x7]
     sp = 0x603ffeb0    pc = 0x63354261
    Found by: stack scanning

I'm not sure if this is the Android expression of the problem under investigation here or a slightly different issue.

Tracy Walker [:tracy]

Comment 28

•

12 years ago

Jumped to #5 on Fx21 topcrash list

Scoobidiver (away)

Reporter

Comment 29

•

11 years ago

Here are interesting correlations in 21.0 on May 24:
  RtlEnterCriticalSection | PR_Lock | nsThread::Dispatch(nsIRunnable*, unsigned int)|EXCEPTION_ACCESS_VIOLATION_WRITE (641 crashes)
     79% (504/641) vs.   1% (1014/69132) msmpeg2adec.dll (Microsoft DTV-DVD Audio Decoder)
          0% (0/641) vs.   0% (3/69132) 11.0.6001.7000
         79% (504/641) vs.   1% (1011/69132) 6.1.7140.0
     98% (628/641) vs.  21% (14805/69132) mfreadwrite.dll (Media Foundation ReadWrite)
          3% (19/641) vs.   2% (1430/69132) 12.0.7600.16385
          5% (32/641) vs.   2% (1640/69132) 12.0.7600.16597
         89% (571/641) vs.  15% (10331/69132) 12.0.7601.17514
          1% (6/641) vs.   0% (285/69132) 12.0.7601.17596
     98% (628/641) vs.  21% (14820/69132) mf.dll (Media Foundation)
          3% (19/641) vs.   2% (1407/69132) 12.0.7600.16385
          5% (32/641) vs.   2% (1630/69132) 12.0.7600.16597
         90% (576/641) vs.  15% (10529/69132) 12.0.7601.17514
          0% (1/641) vs.   0% (50/69132) 12.0.7601.21769
     98% (628/641) vs.  22% (14897/69132) mfplat.dll (Media Foundation Platform)
         98% (628/641) vs.  20% (13774/69132) 12.0.7600.16385

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 30

•

11 years ago

I've been able to consistently repro this issue using my old Windows XP box when forcing it to boot with a single core. I believe that I have identified the problem and am working on a fix.

Andrew Truong [:feer56] [:andrewtruong] [:atruong]

Comment 31

•

11 years ago

Reported on forums: https://support.mozilla.org/en-US/questions/960644

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 32

•

11 years ago

Attached patch nsTimerEvent fix — Details — Splinter Review

This was quite the adventure (and exactly the kind of story that I expect to hear from our stability engineer candidates)! I've been able to stamp this out in my repro envrionment. This is _really_ subtle, so please bear with me:

The problem is that nsTimerEvent releases its reference to its timer object too soon. The event has successfully dispatched to the target thread, however in this case the timer thread has not yet returned out of Dispatch() when the nsTimerEvent runs on the target thread.

In the case of MediaDecoderStateMachine, when the timer fires during its shutdown sequence, it triggers the dispatching of some nsRunnables that eventually result in the MediaDecoderStateMachine's destruction on the main thread. As part of its destruction, MediaDecoderStateMachine releases its mTimer object. Also note that the Media State thread shared by all instances of MediaDecoderStateMachine might also be released at this point.

Since nsTimerEvent releases its reference to the same timer when nsTimerEvent::Run() completes, the timer is destroyed and its reference to mEventTarget is also released. Unfortunately mEventTarget is the Media State thread, which was also released above. The timer's efforts to Dispatch() (or finish Dispatch()ing) to that thread will now crash, as the mEventTarget is non-null but its object has been destroyed.

I believe that the fix is to ensure that nsTimerEvent holds a reference to the timer until the nsTimerEvent itself is destroyed, ensuring that the timer and its event target remain valid until dispatching has completely finished.

Attachment #755600 - Flags: review?(benjamin)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 33

•

11 years ago

Nice catch Aaron!

Benjamin Smedberg

Comment 34

•

11 years ago

Comment on attachment 755600 [details] [diff] [review]
nsTimerEvent fix

I'd like a second-review on this.

Attachment #755600 - Flags: review?(ehsan)

Attachment #755600 - Flags: review?(benjamin)

Attachment #755600 - Flags: review+

(no longer active)

Comment 35

•

11 years ago

Comment on attachment 755600 [details] [diff] [review]
nsTimerEvent fix

Review of attachment 755600 [details] [diff] [review]:
-----------------------------------------------------------------

Great analysis!

Attachment #755600 - Flags: review?(ehsan) → review+

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 36

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c51e0af2a043

Try build:
https://tbpl.mozilla.org/?tree=Try&rev=1e67d2d0b9dd

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 37

•

11 years ago

Comment on attachment 755600 [details] [diff] [review]
nsTimerEvent fix

[Approval Request Comment]
Regression caused by (bug #): Unknown, affected code showed revision 1 in hg, so it's pretty old. This bug relies on very subtle thread timings that have been occurring more frequently as of late.
User impact if declined: Browser crashes, particularly on slower machines. #5 topcrasher on release, #10 on beta, #11 on aurora. Crash-stats comments suggest that this often happens while listening to music on pandora.com.
Testing completed (on m-c, etc.): Ran patched nightly and release try builds on test machine that could reproduce issue, no longer crashes
Risk to taking this patch (and alternatives if risky): Very low, a simple change
String or IDL/UUID changes made by this patch: None

Attachment #755600 - Flags: approval-mozilla-release?

Attachment #755600 - Flags: approval-mozilla-beta?

Attachment #755600 - Flags: approval-mozilla-aurora?

Ryan VanderMeulen [:RyanVM]

Comment 38

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/c51e0af2a043

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

Target Milestone: mozilla23 → mozilla24

Alex Keybl [:akeybl]

Updated

•

11 years ago

status-firefox21: --- → wontfix

status-firefox22: --- → affected

status-firefox23: --- → affected

status-firefox24: --- → fixed

tracking-firefox22: --- → +

tracking-firefox23: --- → +

tracking-firefox24: --- → +

Alex Keybl [:akeybl]

Updated

•

11 years ago

Attachment #755600 - Flags: approval-mozilla-release?

Attachment #755600 - Flags: approval-mozilla-release-

Attachment #755600 - Flags: approval-mozilla-beta?

Attachment #755600 - Flags: approval-mozilla-beta+

Attachment #755600 - Flags: approval-mozilla-aurora?

Attachment #755600 - Flags: approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 39

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/dc00e08878b8
https://hg.mozilla.org/releases/mozilla-beta/rev/f1a215b86459

status-firefox22: affected → fixed

status-firefox23: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 42

•

11 years ago

Backed out from Aurora due to frequent Windows XP mochitest leaks. No idea why m-c or beta don't seem to be leaking.
https://hg.mozilla.org/releases/mozilla-aurora/rev/5d2958ada047

https://tbpl.mozilla.org/php/getParsedLog.php?id=23747986&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=23744096&tree=Mozilla-Aurora

status-firefox23: fixed → affected

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

11 years ago

Depends on: 885095

Ryan VanderMeulen [:RyanVM]

Comment 43

•

11 years ago

Comment on attachment 755600 [details] [diff] [review]
nsTimerEvent fix

Missed the Aurora train. If this still needs to land on Fx23, please request beta approval for whatever patch needs to eventually land.

(Clearing the old approval flags so this doesn't turn up in my queries in the mean time)

Attachment #755600 - Flags: approval-mozilla-beta+

Attachment #755600 - Flags: approval-mozilla-aurora+

Robert Kaiser

Comment 44

•

11 years ago

Ugh, wait, this has landed in 22 and 24 but was backed out from 23?

Ryan VanderMeulen [:RyanVM]

Comment 45

•

11 years ago

Correct

Lukas Blakk [:lsblakk] use ?needinfo

Comment 46

•

11 years ago

Can we get a Beta uplift nomination?

Flags: needinfo?(aklotz)

Lukas Blakk [:lsblakk] use ?needinfo

Comment 47

•

11 years ago

Discussed with Aaron in IRC and since bug 885095 has been filed to track the intermittent leak in mochitests that was uncovered by this bug fix we do not believe this bug to be the cause so must take it on branch and track investigation of the leak in bug 885095.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 48

•

11 years ago

Comment on attachment 755600 [details] [diff] [review]
nsTimerEvent fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown, affected code showed revision 1 in hg, so it's pretty old. This bug relies on very subtle thread timings that have been occurring more often lately.
User impact if declined: Browser crashes, particularly on slower machines. Previously a #5 topcrasher on Firefox 21 release. Currently #19 on beta. Crash-stats comments suggest that this often happens while listening to music on pandora.com.
Testing completed (on m-c, etc.): Currently landed on nightly, aurora, and release, eliminating this crash on those channels.
Risk to taking this patch (and alternatives if risky): Low. There is a known intermittent leak being tracked in bug 885095 that occurs after landing this patch. As commented in that bug, this patch reveals those symptoms but is not actually the cause.
String or IDL/UUID changes made by this patch: None.

Attachment #755600 - Flags: approval-mozilla-beta?

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 49

•

11 years ago

Clearing needinfo.

Flags: needinfo?(aklotz)

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

11 years ago

Attachment #755600 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 50

•

11 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/b75aa2e0a5a8

Here goes...

status-firefox23: affected → fixed

u279076

Updated

•

11 years ago

Keywords: verifyme

Cornel Ionce [:noni] [Hubs QA]

Comment 51

•

11 years ago

Verified on Firefox 23 beta 5 (build ID: 20130711122148).
No crash occured in more than one hour using the attached testcases.

status-firefox23: fixed → verified

Keywords: verifyme

Cornel Ionce [:noni] [Hubs QA]

Updated

•

11 years ago

QA Contact: cornel.ionce

Cornel Ionce [:noni] [Hubs QA]

Comment 52

•

11 years ago

Using the attached testcases no crash occured for more than 90 minutes.
Verified on Firefox 24 beta 4 (build ID: 20130819170952)

status-firefox24: fixed → verified

test audio file for crash testcase. 12 years ago laceywww 7.01 KB, audio/ogg		Details
Crash testcase. loops audio file leadin to a crash. 12 years ago laceywww 582 bytes, text/html		Details
null checking for mEventTarget 12 years ago (No longer employed by Mozilla) Aaron Klotz 1.38 KB, patch	benjamin : review-	Details \| Diff \| Splinter Review
null checking for mEventTarget, v2 12 years ago (No longer employed by Mozilla) Aaron Klotz 1.50 KB, patch	benjamin : review+	Details \| Diff \| Splinter Review
nsTimerEvent fix 11 years ago (No longer employed by Mozilla) Aaron Klotz 2.46 KB, patch	benjamin : review+ ehsan.akhgari : review+ lsblakk : approval-mozilla-beta+ akeybl : approval-mozilla-release-	Details \| Diff \| Splinter Review