Closed Bug 694344 Opened 13 years ago Closed 12 years ago

crash WaitForSingleObjectEx with invalid parameter handler called from rand_s

Categories

(Firefox :: General, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox10 - ---

People

(Reporter: marcia, Assigned: marcia)

References

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-af2b41cf-b71c-4799-9977-980e52111013 .
============================================================= 

This showed up in the explosive report - there have been a few spikes recently. https://crash-stats.mozilla.com/report/list?signature=WaitForSingleObjectEx%20|%20WaitForSingleObject%20|%20google_breakpad%3A%3AExceptionHandler%3A%3AWriteMinidumpOnHandlerThread%28_EXCEPTION_POINTERS*%2C%20MDRawAssertionInfo*%29

220 Crashes using the 2011101200 build and there have been spikes over 100 crashes using 2011092800 and 2011092900

Frame 	Module 	Signature [Expand] 	Source
0 	ntdll.dll 	KiFastSystemCallRet 	
1 	ntdll.dll 	ZwWaitForSingleObject 	
2 	kernel32.dll 	WaitForSingleObjectEx 	
3 	kernel32.dll 	WaitForSingleObject 	
4 	xul.dll 	google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread 	toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:764
5 	xul.dll 	google_breakpad::ExceptionHandler::HandleInvalidParameter 	toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:619
6 	msvcr80.dll 	rand_s 	f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\rand_s.c:86
7 	xul.dll 	`anonymous namespace'::RandUint32 	ipc/chromium/src/base/rand_util_win.cc:16
8 	xul.dll 	base::RandUint64 	ipc/chromium/src/base/rand_util_win.cc:25
9 	xul.dll 	base::RandInt 	ipc/chromium/src/base/rand_util.cc:20
10 	xul.dll 	ChildProcessInfo::GenerateRandomChannelID 	ipc/chromium/src/chrome/common/child_process_info.cc:58
11 	xul.dll 	ChildProcessHost::CreateChannel 	ipc/chromium/src/chrome/common/child_process_host.cc:78
12 	xul.dll 	mozilla::ipc::GeckoChildProcessHost::InitializeChannel 	ipc/glue/GeckoChildProcessHost.cpp:350
13 	xul.dll 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:318
14 	xul.dll 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:326
15 	xul.dll 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:426
16 	xul.dll 	base::MessagePumpForIO::DoRunLoop 	ipc/chromium/src/base/message_pump_win.cc:462
17 	xul.dll 	base::MessagePumpWin::RunWithDispatcher 	ipc/chromium/src/base/message_pump_win.cc:53
18 	xul.dll 	base::MessagePumpWin::Run 	ipc/chromium/src/base/message_pump_win.h:78
19 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:201
20 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:175
21 	xul.dll 	base::Thread::ThreadMain 	ipc/chromium/src/base/thread.cc:156
22 	xul.dll 	`anonymous namespace'::ThreadFunc 	ipc/chromium/src/base/platform_thread_win.cc:26
23 	kernel32.dll 	BaseThreadStart
This is by far the #1 crash signature on trunk in the last days.

Ted, I see breakpad in there, which makes me wonder if there's another failure in correctly processing the stack is involved here?
So the actual error site is rand_s, which should be the signature starting point: we get a callback from the CRT for invalid parameters which is the top of this stack and is ignorable. Can we get data on whether the entire spike is the same stack?

I must admit I don't see how we can possibly be *causing* this invalid parameter error: the callsite in question is http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/rand_util_win.cc#15 and we can't be passing NULL or anything like that.
Reading the VC8 source code to rand_s, I'm pretty sure we're hitting an invalid-parameter error where we can load advapi32.dll (it is loaded dynamically) but the following line fails:

pfnRtlGenRandom = ( PGENRANDOM ) GetProcAddress( hAdvApi32, _TO_STR( RtlGenRandom ) );

If this is the case, we're either hitting an odd windows configuration or we might have problems with library loading. Does this signature perhaps coincide with bug 677797 (mandatory ASLR)? Does it happen only with certain versions/SP levels of Windows?
Looking at crash stats, it seems it happens on XP across different SP (2 and 3). The same thing happens for Windows 7 - there are some that have SP 1 and some that do not.
This has definitely started happening more frequently since bug 677797 has landed.

advapi32.dll should already be loaded when this code is run, so this should just be a failure in GetProcAddress, which _should_ be unaffected by the mandatory ASLR patch...
Blocks: 677797
I'm going to make this bug specific to rand_s and give it to Ehsan as the potential regressor.
Assignee: nobody → ehsan
Summary: crash WaitForSingleObjectEx → crash WaitForSingleObjectEx with invalid parameter handler called from rand_s
bug 695791 covers fixing the skiplist to get useful signatures out of these.
I think this should track/block Firefox 10 and bug 677797 should be backed out if we don't understand the issue.
It looks like this happens on some systems for the first call to rand_s (uptimes are low and we are creating a channel). advapi32.dll has the ASLR bit enabled though so it doesn't look like the code in bug 677797 would directly play a role in it's loading. 

Certainly seems to be related to something that landed on the 11th though. Maybe the best next step would be to backout or disable bug 677797 to confirm it was the cause.
Is this also related?

875ecc34-c978-4208-96bc-1ccdf2111015

[@ WaitForMultipleObjectsEx | WaitForMultipleObjects | google_breakpad::CrashGenerationClient::SignalCrashEventAndWait() ]
(In reply to JK from comment #10)
> Is this also related?
> 
> 875ecc34-c978-4208-96bc-1ccdf2111015
> 
> [@ WaitForMultipleObjectsEx | WaitForMultipleObjects |
> google_breakpad::CrashGenerationClient::SignalCrashEventAndWait() ]

Doesn't look like it. Looks like it may have been caused by a 3rd party dll - znsprnui.dll, which according to the internet - 

znsprnui.dll is a ZNSPRNUI.DLL belonging to Zeon (Beijing) Corp. PDF Driver from Zeon Corp. Non-system processes like znsprnui.dll originate from software you installed on your system.
(In reply to Jim Mathies [:jimm] from comment #9)
> It looks like this happens on some systems for the first call to rand_s
> (uptimes are low and we are creating a channel). advapi32.dll has the ASLR
> bit enabled though so it doesn't look like the code in bug 677797 would
> directly play a role in it's loading. 
> 
> Certainly seems to be related to something that landed on the 11th though.
> Maybe the best next step would be to backout or disable bug 677797 to
> confirm it was the cause.

I can do that if you want me to.
(In reply to Ehsan Akhgari [:ehsan] from comment #12)
> (In reply to Jim Mathies [:jimm] from comment #9)
> > It looks like this happens on some systems for the first call to rand_s
> > (uptimes are low and we are creating a channel). advapi32.dll has the ASLR
> > bit enabled though so it doesn't look like the code in bug 677797 would
> > directly play a role in it's loading. 
> > 
> > Certainly seems to be related to something that landed on the 11th though.
> > Maybe the best next step would be to backout or disable bug 677797 to
> > confirm it was the cause.
> 
> I can do that if you want me to.

I won't be able to look into this further until later in the week, so if we want to run this as an experiment in one nightly we might as well do it. Maybe we get lucky and find out it's not the cause.
(In reply to Jim Mathies [:jimm] from comment #13)
> (In reply to Ehsan Akhgari [:ehsan] from comment #12)
> > (In reply to Jim Mathies [:jimm] from comment #9)
> > > It looks like this happens on some systems for the first call to rand_s
> > > (uptimes are low and we are creating a channel). advapi32.dll has the ASLR
> > > bit enabled though so it doesn't look like the code in bug 677797 would
> > > directly play a role in it's loading. 
> > > 
> > > Certainly seems to be related to something that landed on the 11th though.
> > > Maybe the best next step would be to backout or disable bug 677797 to
> > > confirm it was the cause.
> > 
> > I can do that if you want me to.
> 
> I won't be able to look into this further until later in the week, so if we
> want to run this as an experiment in one nightly we might as well do it.
> Maybe we get lucky and find out it's not the cause.

Backed out.  Tomorrow's nightly should not have mandatory ASLR any more.
did this fix the issue?
Marcia, can you verify that this spike went away? There may be other WaitForSingleObjectEx crashes, but this bug was specifically about the spike from the ASLR patch.
Assignee: ehsan → mozillamarcia.knous
Things don't seem to be quite as explosive as they were in October, here are some numbers from recent build IDs: 

2011111000 	1  (Trunk)
2011110900 	38 (Firefox Beta)
2011110800 	6
2011110700 	2
2011110400 	15 (Firefox 8)
2011110300 	22
2011110200 	17
[Triage Comment]
Given that this is no longer explosive, and this should have made the Aurora cutover, minusing tracking-firefox10.
Marcia, I'm more asking whether this particular version of the crash (from rand_s) is completely gone, in which case this bug can be marked FIXED.
Marcia: ping?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
See Also: → 951827
See Also: → 1167248
You need to log in before you can comment on or make changes to this bug.