Closed Bug 709209 Opened 13 years ago Closed 12 years ago

Fix browser-side hang signatures for 11.0a1 trunk

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: kairo, Assigned: lars)

References

Details

All browser signatures show up as https://crash-stats.mozilla.com/report/list?signature=hang%20|%20WaitForSingleObjectEx%20|%20WaitForSingleObject%20|%20google_breakpad%3A%3AExceptionHandler%3A%3AWriteMinidumpOnHandlerThread%28_EXCEPTION_POINTERS*%2C%20MDRawAssertionInfo*%29 since yesterday, but they should show us what's below breakpad in the stack.

I don't really have an idea what to look for there in the signature generation, I hope ted or bsmedberg can shed some light on where we need to start the stack we actually want.

This has changed due to a code change between 2011-12-07 and -08, as all other branches show the correct signatures, and hangs from the 7th also do (and Socorro hasn't changed).
As I mentioned on IRC, this was originally implemented in bug 559174 and was "refixed" once in bug 630233. The spec was contained in those bugs, and you should check whether these new reports don't meet the original spec or whether something else is happening.
Laura, can this be assigned? It's a very high priority since the hang report signatures are mostly useless right now.
The SignatureSentinel rule referenced in bug 559174 and bug 630233 works like this:

'mozilla::ipc::RPCChannel::Call(IPC::Message*, IPC::Message*)' is a 
     signature sentinel iff 'CrashReporter::CreatePairedMinidumps(void*, 
     unsigned long, nsAString_internal*, nsILocalFile**, nsILocalFile**)'
     is found somewhere in the stack frame.

the signature 'mozilla::ipc::RPCChannel::Call(IPC::Message*, IPC::Message*)' does not appear in the crashing thread stack frames for any of these crashes, so the rule does not apply.  

Please look through the example crashes and tell me what you want the signature to be.  I will devise an additional rule to make it happen.
Assignee: nobody → lars
Crap. You see from https://crash-stats.mozilla.com/report/index/2ec0ccb1-449d-4ec0-b220-3a5012111218 that where the ::Call frame should be breakpad isn't walking the stack very well and reports nspr4.dll@0x26bf

Is this then a breakpad or symbol issue?
Ted will have to weigh in on whether it is a breakpad or symbol issue.  

I was about to volunteer to make a new rule that used 'nspr4.dll@0x26bf' as the sentinel, but surveying the crashes, I can see that it is not consistent.  I'm seeing at least four different hex offsets: 0x26bf, 0x265f, 0x270f, 0x26df.
Component: Socorro → General
Product: Webtools → Socorro
Blocks: 676458
Component: General → Infra
We also have a lot of https://crash-stats.mozilla.com/report/list?signature=hang%20%7C%20_SEH_epilog4 - I guess that's the same problem.
Yeah, it's the same core problem, we aren't stackwalking correctly right at the ::Call symbol. The top of the stack is going to be pretty random.
I pulled down the minidump for https://crash-stats.mozilla.com/report/index/34ec04ea-75e3-4aaf-927c-d62812120129

And MSVC gives the following stack:
>	xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(dump_path=<Bad Ptr>, write_exception_stream=true, callback=0x00000000, callback_context=0x00000000)  Line 797	C++
 	ntdll.dll!_NtWaitForSingleObject@12() 	
 	KERNELBASE.dll!_WaitForSingleObjectEx@12() 	
 	kernel32.dll!_WaitForSingleObjectExImplementation@12() 	
 	kernel32.dll!_WaitForSingleObject@8() 	
 	xul.dll!google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(exinfo=0x0024c1c8, assertion=0x00000000)  Line 764	C++
 	xul.dll!google_breakpad::ExceptionHandler::WriteMinidumpForException(exinfo=0x0024c1c8)  Line 780	C++
 	xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(dump_path=<Bad Ptr>, write_exception_stream=true, callback=0x555d5487, callback_context=0x0024c5dc)  Line 821	C++
 	xul.dll!CrashReporter::CreatePairedMinidumps(childPid=0x000007d0, childBlamedThread=5696, pairGUID=0x16a76844, childDump=0x0024c6a0, parentDump=0x0024c69c)  Line 2209	C++
 	xul.dll!mozilla::dom::CrashReporterParent::GeneratePairedMinidump<mozilla::plugins::PluginModuleParent>(t=0x00000000)  Line 159	C++
 	xul.dll!mozilla::plugins::PluginModuleParent::ShouldContinueFromReplyTimeout()  Line 225	C++
 	xul.dll!mozilla::ipc::SyncChannel::ShouldContinueFromTimeout()  Line 264	C++
 	xul.dll!mozilla::ipc::RPCChannel::Call(_msg=0x011abe40, reply=0x0024c870)  Line 211	C++
 	xul.dll!mozilla::plugins::PPluginScriptableObjectParent::CallHasProperty(aId=0x153c5600, aHasProperty=0x0024c8bb)  Line 288	C++
 	xul.dll!mozilla::plugins::PluginScriptableObjectParent::ScriptableHasProperty(aObject=0x1695a3c0, aName=0x06b9c5e0)  Line 291	C++
 	xul.dll!NPObjWrapper_NewResolve(cx=0x0efd3ee0, obj=0x182609a0, id=112838112, flags=1, objp=0x0024c934)  Line 1651	C++
 	mozjs.dll!CallResolveOp(cx=0x0efd3ee0, start=0x182c7700, obj=0x00000000, id=112838112, flags=65535, objp=0x0024c95c, propp=0x0024c960, recursedp=0x0024c980)  Line 5488	C++
 	mozjs.dll!js_GetPropertyHelper(cx=0x0000ffff, obj=0x182c7700, id=112837888, getHow=1, vp=0x0024c9ac)  Line 5999	C++
 	mozjs.dll!js::mjit::stubs::GetProp(f={...})  Line 1508	C++
 	mozjs.dll!js::mjit::EnterMethodJIT(cx=0x0efd3ee0, fp=0x05980060, code=0x03e0ceac, stackLimit=0x059a0000, partial=false)  Line 1064	C++

So it appears that the PDB at least has enough information to process this fully. I'm going to poke around with processing the .dmp directly next.
Blocks: 722394
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
>  	xul.dll!mozilla::ipc::RPCChannel::Call(_msg=0x011abe40, reply=0x0024c870) 
> Line 211	C++

Interesting, Socorro has this instead:

nspr4.dll 	nspr4.dll@0x26cf


This is even a different DLL. Is that expected? I *guess* this frame is pretty much where the sentinel should match - or somewhere near there.
There is nothing even remotely interesting about the regression range: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=658fad825c36&tochange=6785d3003414

No build-config changes, no significant makefile changes. The only thing we're doing is maybe adding a bit more code which might cause PGO to reorder something. But even then there's nothing we'd back out to fix this.

I'm debugging minidump-stackwalk further to figure out why the frame pointer isn't what it expects.
We're hitting a bug in the MSVC2005 linker, mentioned here: http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/stackwalker_x86.cc#254

In this case because we don't have the correct frame information, we're using ScanForReturnAddress. The stack memory is:

0x0024C7B4  1e1abe40 1b3c2588 677a26d0 555fa609

We start our scan at the "1b3c..." value, which is definitely incorrect. We then hit the "677a..." value, which is incorrect but seemingly reasonable (it's in a non-code section of NSPR). So we don't both to keep scanning for the correct "555fa..." value.

I don't know how much we can hack around this in the processor. We could try to prefer a close definitely-correct value over a possibly-correct value, but that might introduce other unusual behaviors. This is heuristically gross.
We're going to switch compilers tomorrow, so this might not be worth worrying about.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #13)
> We're going to switch compilers tomorrow, so this might not be worth
> worrying about.

I disagree there. We are not switching for 11 and 12, and we need reasonable signatures for those.
For 13, we actually might not switch right away tomorrow, but we can live another week or two on trunk with broken hang signatures if we know they'll be fixed for _the majority_ of the trunk cycle.
Still, that doesn't help for 12 and even more 11, for which we continue to miss reasonable signatures, and where we seem to have a really serious problem with bug 722394.

In short, we really need a fix for this, and we need it "yesterday on Aurora" and not "tomorrow on trunk".
If anything, this appears to have gotten worse with the switch to VC2010: https://crash-stats.mozilla.com/report/index/492d83a5-dbe8-4c81-b6e7-5c4342120203
Fixed by bug 726570
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Actually, it looks like this is not fixed for 11 and 12, which are still on MSVC2005 - it looks fixed on 13 with MSVC2010 though.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
bsmedberg, so from what you said on IRC, this should be considered WONTFIX then and we'll just have to live with having no reasonable browser-side hang signatures for 11 and 12?
Well, I don't know of any way we *can* fix this without completely disabling PGO/LTCG, which I don't think is an option we'd consider. So CANTFIX or INCOMPLETE, I guess.
So, all you are saying is that this is a bug but we won't fix it (because we don't know how and it doesn't make too much sense right now). Marking as such.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.