Closed Bug 739142 Opened 12 years ago Closed 12 years ago

crash in nsTimerImpl::PostTimerEvent

Categories

(Core :: Networking: HTTP, defect)

14 Branch
All
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla14
Tracking Status
firefox14 + fixed

People

(Reporter: scoobidiver, Unassigned)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

There's a spike in crashes from 14.0a1/20120324.
The regression range for the spike is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ab2ff3b5611f&tochange=df1f94b2bdee
It might be a regression from bug 738334.

Signature 	nsTimerImpl::PostTimerEvent() More Reports Search
UUID	fc9cad76-9dc4-4873-8d98-a17392120324
Date Processed	2012-03-24 16:53:20
Uptime	86
Last Crash	1.5 minutes before submission
Install Age	2.1 hours since version was first installed.
Install Time	2012-03-24 14:47:50
Product	Firefox
Version	14.0a1
Build ID	20120324031100
Release Channel	nightly
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	AuthenticAMD family 15 model 36 stepping 2
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0x1000020
User Comments	it appears going onto an https site silently crashes firefox
App Notes 	
Cisco VPN
AdapterVendorID: 0x1002, AdapterDeviceID: 0x5955, AdapterSubsysID: 10921734, AdapterDriverVersion: 8.561.0.0
D3D10 Layers? D3D10 Layers- D3D9 Layers? D3D9 Layers- 
EMCheckCompatibility	True	
Total Virtual Memory	2147352576
Available Virtual Memory	1911676928
System Memory Use Percentage	83
Available Page File	759312384
Available Physical Memory	163864576

Frame 	Module 	Signature [Expand] 	Source
0 	xul.dll 	nsTimerImpl::PostTimerEvent 	xpcom/threads/nsTimerImpl.cpp:605
1 	xul.dll 	TimerThread::Run 	xpcom/threads/TimerThread.cpp:319
2 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:657
3 	xul.dll 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:289
4 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
5 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
6 	msvcr100.dll 	_callthreadstartex 	f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c:314
7 	msvcr100.dll 	_threadstartex 	f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c:292
8 	kernel32.dll 	BaseThreadInitThunk 	
9 	ntdll.dll 	__RtlUserThreadStart 	
10 	ntdll.dll 	_RtlUserThreadStart 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsTimerImpl%3A%3APostTimerEvent%28%29
There's no web worker running here as far as I can tell, so this probably isn't a regression from 738334.
With combined signatures, it's #8 top crasher in 14.0a1 over the last day.
Crash Signature: [@ nsTimerImpl::PostTimerEvent()] → [@ nsTimerImpl::PostTimerEvent()] [@ RtlEnterCriticalSection | PR_Lock | nsTimerImpl::PostTimerEvent()]
Keywords: topcrash
Some disasembly:

1004E290  mov         eax,dword ptr [nsTimerEvent::sAllocator (10DEEC60h)]
1004E295  push        ebx
1004E296  mov         ebx,dword ptr [esp+8]  
1004E29A  push        ebp  
1004E29B  push        esi  
1004E29C  push        edi  
1004E29D  lea         edi,[eax+20h]  
1004E2A0  mov         esi,eax  
1004E2A2  mov         eax,dword ptr [edi]  
1004E2A4  push        eax  
1004E2A5  call        dword ptr [__imp__PR_Lock (109B9A40h)]

Crash is @ 0x1004E2A2. 
EAX = 0x00000000
EDI = 0x00000020
ESI = 0x00000000
ESP = 0x07a2f6a8
(In reply to ben turner [:bent] from comment #4)
> Perhaps related to bug 733277?
It's not in the regression range.
It messes with timer locks and allocators, which is exactly where this crash is.
From comment 3, I bet money that what happens is that sAllocator is null here.  Investigating.
Assignee: nobody → ehsan
Should this be Trunk?

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:14.0) Gecko/20120403 Firefox/14.0a1 ID:20120403063158 just crashed like bp-449f0d79-66c2-4332-8dd3-610292120403 on a random Google search results page.
OK, this is really weird.  I grabbed the minidumps for two crashes from two different builds.  They both get blamed for the nsTimerImpl::PostTimerEvent thread, but the crash is happening on another thread in fact!  Here's the stack from the minidump corresponding to https://crash-stats.mozilla.com/report/index/fc9cad76-9dc4-4873-8d98-a17392120324:

 	ntdll.dll!_KiFastSystemCallRet@0() 	
 	ntdll.dll!_ZwWaitForSingleObject@12()  + 0xc bytes	
 	ntdll.dll!_RtlpWaitOnCriticalSection@8()  + 0xc4 bytes	
 	ntdll.dll!_RtlEnterCriticalSection@4()  - 0x155c4 bytes	
>	xul.dll!google_breakpad::AutoExceptionHandler::AutoExceptionHandler()  Line 461	C++
 	xul.dll!google_breakpad::ExceptionHandler::HandleException(_EXCEPTION_POINTERS * exinfo)  Line 500	C++
 	msvcr71.dll!__CxxUnhandledExceptionFilter(_EXCEPTION_POINTERS * pPtrs)  + 0x46 bytes	C++
 	kernel32.dll!_UnhandledExceptionFilter@4()  + 0xf5 bytes	
 	ntdll.dll!___RtlUserThreadStart@8()  + 0x147ae bytes	
 	ntdll.dll!@_EH4_CallFilterFunc@8()  + 0x12 bytes	
 	ntdll.dll!ExecuteHandler2@20()  + 0x26 bytes	
 	ntdll.dll!ExecuteHandler@20()  + 0x24 bytes	
 	ntdll.dll!_RtlDispatchException@8()  + 0xd3 bytes	
 	ntdll.dll!_KiUserExceptionDispatcher@8()  + 0xf bytes	
 	msvcr100.dll!memmove(unsigned char * dst, unsigned char * src, unsigned long count)  Line 185	Asm
 	xul.dll!nsTArray_base<nsTArrayDefaultAllocator>::ShiftData(unsigned int start, unsigned int oldLen, unsigned int newLen, unsigned int elemSize, unsigned int elemAlign)  Line 277 + 0xd bytes	C++
 	xul.dll!nsHttpPipeline::WriteSegments(nsAHttpSegmentWriter * writer, unsigned int count, unsigned int * countWritten)  Line 775	C++
 	xul.dll!nsHttpConnection::OnSocketReadable()  Line 1369	C++
 	xul.dll!nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream * in)  Line 1489	C++
 	xul.dll!nsSocketInputStream::OnSocketReady(unsigned int condition)  Line 257	C++
 	xul.dll!nsSocketTransport::OnSocketReady(PRFileDesc * fd, short outFlags)  Line 1578	C++
 	xul.dll!nsSocketTransportService::DoPollIteration(bool wait)  Line 757	C++
 	xul.dll!nsSocketTransportService::Run()  Line 648	C++
 	xul.dll!nsThread::ProcessNextEvent(bool mayWait, bool * result)  Line 658	C++
 	xul.dll!nsThread::ThreadFunc(void * arg)  Line 289 + 0xf bytes	C++
 	nspr4.dll!_PR_NativeRunThread(void * arg)  Line 448	C
 	nspr4.dll!pr_root(void * arg)  Line 122 + 0xd bytes	C
 	msvcr100.dll!_callthreadstartex()  Line 314 + 0x6 bytes	C
 	msvcr100.dll!_threadstartex(void * ptd)  Line 292 + 0x5 bytes	C
 	kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes	
 	ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes	
 	ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes	

In both cases, these threads were listed right before the timer thread.  These threads don't appear on the breakpad site at all.  Looking at exinfo->ExceptionRecord in ExceptionHandler::HandleException, this is indeed an access violation, and the ExceptionAddress field points to the memmove function on the stack.

I don't know why Breakpad blames this on the wrong thread.  Should this be moved to Necko, and another bug be filed to investigate this on the breakpad side?
Please, fix this quickly. I like the changes on Firefox 14 but it crash per minute and is very inestable
This version is super faster and the loading times are amazing (at least for me). Do you have introduced some changes to Gecko?
When you load the minidump into MSVC does it get the correct crashing thread?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #12)
> When you load the minidump into MSVC does it get the correct crashing thread?

No, but I think it just grabs the crashing thread from the minidump, which is written by breakpad.
Yunier, can you reproduce this crash reliably on Nightly?
The minidump is not written by breakpad, it's written by the Windows debughlp function MinidumpWriteDump. We do pass in the exception info from the crash site which ends up here: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc#977
Hmm, OK.  Is the crash stats report also generated from the minidump?  The weird thing is that the crash report doesn't have any mention of the thread running nsHttpPipeline::WriteSegments.

Here's another question: is it possible that a crash on a thread is handled on a different thread by breakpad?
The crash report is generated from the minidump using minidump_stackwalk (in the Breakpad tree). There are a few tricky bits:
1) Breakpad uses a background thread that does the actual dump writing. It saves the ID of that thread and tries to ignore it in the output. Perhaps that's what's failing htere?
2) We use SetUnhandledExceptionFilter to catch crashes, so whatever the behavior of that is. As bsmedberg says, we simply pass on the EXCEPTION_POINTERS that we get from that callback to MinidumpWriteDump.
(In reply to Ted Mielczarek [:ted] from comment #17)
> The crash report is generated from the minidump using minidump_stackwalk (in
> the Breakpad tree). There are a few tricky bits:
> 1) Breakpad uses a background thread that does the actual dump writing. It
> saves the ID of that thread and tries to ignore it in the output. Perhaps
> that's what's failing htere?
> 2) We use SetUnhandledExceptionFilter to catch crashes, so whatever the
> behavior of that is. As bsmedberg says, we simply pass on the
> EXCEPTION_POINTERS that we get from that callback to MinidumpWriteDump.

SetUnhandledExceptionFilter causes the exception handler to be called on the thread raising the exception, always.

As I said, when I was examining the exception being handled in the debugger, it was an access violation caused by the memmove call on the thread which ended up getting removed from the crash report, which is why I'm pretty sure that this crash is being misblamed at nsTimerImpl::PostTimerEvent for some reason...
I discussed this with Ted on IRC.  One theory is that the memmove call is writing over the address holding sAllocator, which causes the timer thread to crash while that memove is happening on another thread (and then for memmove to crash as well).

I'm pretty sure that this crash is also hiding under other seemingly unrelated signatures as well.  Moving over to Necko.
Assignee: ehsan → nobody
Component: XPCOM → Networking: HTTP
QA Contact: xpcom → networking.http
I recognize the stack trace in comment 9 as one from bug 741056. The patch for 741056 went to inbound a few hours ago.
bug 738812 also has an odd timer thing going on.
(In reply to Patrick McManus [:mcmanus] from comment #20)
> I recognize the stack trace in comment 9 as one from bug 741056. The patch
> for 741056 went to inbound a few hours ago.

That is very plausible!
Depends on: 741056
(In reply to Patrick McManus [:mcmanus] from comment #21)
> bug 738812 also has an odd timer thing going on.

That doesn't look relevant to me.  The timer code run there runs on the main thread, not on the timer thread.
Ehsan, can you point me to any info on how to do the magic you talk about in comment 9? "
I grabbed the minidumps for two crashes from two different builds.  They both get blamed for the nsTimerImpl::PostTimerEvent thread, but the crash is happening on another thread in fact! "
(In reply to Patrick McManus [:mcmanus] from comment #24)
> Ehsan, can you point me to any info on how to do the magic you talk about in
> comment 9? "
> I grabbed the minidumps for two crashes from two different builds.  They
> both get blamed for the nsTimerImpl::PostTimerEvent thread, but the crash is
> happening on another thread in fact! "

You need special privileges to download minidumps from crash-stats (or know someone who does!) because they can contain user's private data on the stack.  I'll forward the minidump for you using email.  :-)
(In reply to Ehsan Akhgari [:ehsan] from comment #25)

> You need special privileges to download minidumps from crash-stats (or know
> someone who does!) because they can contain user's private data on the
> stack.  I'll forward the minidump for you using email.  :-)

ah. I'm actually curious if any of the crash data for 740319 shows the same thing..
I don't have access myself.  Ping joe (or #developers).
Interesting crash, should be tracked by bug 739762 too.

(In reply to Yunier José Sosa Vázquez from comment #11)
> This version is super faster and the loading times are amazing (at least for
> me). Do you have introduced some changes to Gecko?

Pipelining patches has landed.
Blocks: 739762
(In reply to Ehsan Akhgari [:ehsan] from comment #14)
> Yunier, can you reproduce this crash reliably on Nightly?

When I open Nightly and start to browser this crash per minute. In my PC Nightly during opened more or less one minute only, after this crash. Some times the Windows Error it sohw me, anothers times the Mozilla Crash Reporter informs me about this.
(In reply to Honza Bambas (:mayhemer) from comment #28)
> Interesting crash, should be tracked by bug 739762 too.
> 
> (In reply to Yunier José Sosa Vázquez from comment #11)
> > This version is super faster and the loading times are amazing (at least for
> > me). Do you have introduced some changes to Gecko?
> 
> Pipelining patches has landed.

Great !!! this feature many users will note. I have commented with my friends this and have said the same
FYI, if having access to minidumps would be useful, you can file a bug and get that access. It's just an LDAP bit.
Yunier, it would be great if you can test this again in tonight's Nightly and let us know if the crash has been fixed.  Thanks!
(In reply to Ehsan Akhgari [:ehsan] from comment #32)
> Yunier, it would be great if you can test this again in tonight's Nightly
> and let us know if the crash has been fixed.  Thanks!

Ehsan the problem has been fixed in the last Nightly (2012-04-05) and all work perfect. Congratulations for you work for fix this bug. I see a great interest by the developers to know more about the bug and resolve it quickly
Well, Patrick deserves all of the credit here!  :-)

Thanks for verifying this, Yunier!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to Ehsan Akhgari [:ehsan] from comment #34)
> Well, Patrick deserves all of the credit here!  :-)
> 
> Thanks for verifying this, Yunier!

Well, congratulations to Patrick too :-). The people of Mozilla is great.
Target Milestone: --- → mozilla14
(In reply to Swarnava Sengupta (:Swarnava) from comment #36)
> also there are 28% crash in 18.0.1
With so few crashes, it isn't worth a new bug.
the low volume of leftovers could be a dup of bug 792920
You need to log in before you can comment on or make changes to this bug.