Closed Bug 631036 Opened 14 years ago Closed 14 years ago

Update irrelevantSignatureRegEx filter to better break down hang detector stacks

Categories

(Socorro :: General, task)

x86
Windows Vista
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jimm, Assigned: rhelmer)

References

()

Details

Attachments

(3 files, 1 obsolete file)

Spun off from bug 630230. 'KiFastSystemCallRet' lands on the top of the signature for hang detector hangs we record in crash-stats. This makes breaking these stacks down based on different causes really hard. As a result we've not spent much time digging into these, and vendors like Adobe have largely passed these hangs over. For example, we currently have 7724 flash hangs lumped together as the #1 top crasher in beta 10 for the last week. There are hangs in here that are minor and there are hangs in here that occur regularly. What we need to do is separate these out and group them together, so we can try and fix the worst offenders. A typical example: https://crash-stats.mozilla.com/report/index/9c1f63d7-ab81-4025-bbe4-5e4132110201 0 ntdll.dll KiFastSystemCallRet 1 user32.dll NtUserSetFocus 2 user32.dll MB_DlgProc 3 user32.dll InternalCallWinProc preferred - hang | NtUserSetFocus https://crash-stats.mozilla.com/report/index/692a6f94-0359-4515-a383-4c39a2110201 0 ntdll.dll KiFastSystemCallRet 1 user32.dll NtUserWaitMessage 2 user32.dll InternalDialogBox preferred - hang | InternalDialogBox 0 ntdll.dll KiFastSystemCallRet 1 ntdll.dll NtWaitForSingleObject 2 ntdll.dll RtlpWaitOnCriticalSection 3 ntdll.dll RtlpDeCommitFreeBlock 4 NPSWF32.dll F_1706698279__________________________________ preferred - hang | F_1706698279__________________________________ ntdll.dll KiFastSystemCallRet 1 user32.dll NtUserCallHwndLock 2 xul.dll mozilla::plugins::PluginInstanceChild::AnswerUpdateWindow preferred - hang | AnswerUpdateWindow Working up a list of these common frames should be easy. The list would hold most of the common win api calls currently in prefixSignatureRegEx (bug 630230). The key here is that we should only ignore these calls at the top of the stack, once something unrecognized shows up, we start recording the signature. Also, when listing the stack in the crash report page, the entire stack should be provided. These apis should be hidden from developers looking at the crash.
* 'should be hidden' -> '*shouldn't* be hidden'
Is there any way to set something like this up and run some staged runs to see what the results look like? I can work up an initial list of calls to ignore.
Attached file hang list (obsolete) —
This is based on a survey of the first 60 hangs under the current hang detector crash list.
Attached file updated list
Attachment #509283 - Attachment is obsolete: true
here is what a sample of 100 different KiFastSystemCallRet stacks looks like. after the filtering we should get a similar redistribution of stacks, but with new signatures.
I really don't think we want to hide most of these frames, just append them as we're doing in bug 630230. So the signature would be "hang | KiFastSystemCallRet | NtUserSetFocus" which more accurately identifies that NtUserSetFocus is blocked on a system call. You still get the unique signatures. Note that this still won't help a whole lot with hang analysis, because what really matters is the combination of the parent and child stacks. That's what the hang report used to do before it broke with the socorro upgrade, and something we should probably do again as a map/reduce job of some sort.
Severity: normal → critical
What is the specific request being made here? It's kinda trivial to make a patch once we have a list of frames which should be ignored or appended.
Assignee: nobody → lars
Assignee: lars → rhelmer
I think we should skip KiFastSystemCallRet, it doesn't add any value. The next function on the stack is the actual API you're calling, knowing that you're in a syscall doesn't help that much, I don't think (plus you can see it by looking at the actual report if you care anyway).
Is the list in attachment 509284 [details] ready to go, or do we need more discussion? I am standing by to review, help test in stage, and get this into production; just let me know.
No, we definitely do not want to add everything in the attachment to the skip list. Many of those were already added to the append list, although I'm going through the data from today and might move a few of those from the skiplist to the append list.
Argh, I meant from the append list to the skip list.
(In reply to comment #11) > No, we definitely do not want to add everything in the attachment to the skip > list. Many of those were already added to the append list, although I'm going > through the data from today and might move a few of those from the skiplist to > the append list. Skipping that entire list would get us down to the real call we're interested in. My understanding of the append list is a bit hazy, but it seems we should move most of the append list to the skip list. What we want to avoid is multiple hangs grouped under these generic calls on crash-stats.
We really do need to know what's at the top. The whole point of the append list is to let you do that. See crashes from what was deployed yesterday, e.g. https://crash-stats.mozilla.com/report/index/00560eab-4c5e-458e-b646-456282110203 which has a signature of hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA Now it's probably safe to remove KiFastSystemCallRet, but knowing that we're stopped in NtWaitForMultipleObjects is IMO important. We should put as few things as we can in the skiplist, and as much as possible in the appendlist.
(In reply to comment #14) > We really do need to know what's at the top. The whole point of the append list > is to let you do that. See crashes from what was deployed yesterday, e.g. > > https://crash-stats.mozilla.com/report/index/00560eab-4c5e-458e-b646-456282110203 > > which has a signature of > > hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA Ok, this sounds great. On this crash signature, would this be grouped / ranked on crash-stats as a crash in 'hang | KiFastSystemCallRet', 'hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA' or 'hang | CreateFileMappingA'? ('hang | KiFastSystemCallRet' is what we want to avoid.)
(In reply to comment #15) > (In reply to comment #14) > > We really do need to know what's at the top. The whole point of the append list > > is to let you do that. See crashes from what was deployed yesterday, e.g. > > > > https://crash-stats.mozilla.com/report/index/00560eab-4c5e-458e-b646-456282110203 > > > > which has a signature of > > > > hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA > > Ok, this sounds great. > > On this crash signature, would this be grouped / ranked on crash-stats as a > crash in 'hang | KiFastSystemCallRet', 'hang | KiFastSystemCallRet | > NtWaitForMultipleObjects | CreateFileMappingA' or 'hang | CreateFileMappingA'? > > ('hang | KiFastSystemCallRet' is what we want to avoid.) Ah, nm, I'm seeing these broken down now on crash stats since the update.
I think we need to tweek these settings though, for example: [@ hang | KiFastSystemCallRet | ZwWaitForSingleObject | WaitForSingleObjectEx | WaitForSingleObject | google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(_EXCEPTION_POINTERS*, MDRawAssertionInfo*) ] the stack is: 0 ntdll.dll KiFastSystemCallRet 1 ntdll.dll ZwWaitForSingleObject 2 kernel32.dll WaitForSingleObjectEx 3 kernel32.dll WaitForSingleObject 4 xul.dll google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:754 5 xul.dll google_breakpad::ExceptionHandler::WriteMinidumpForException toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:769 6 xul.dll google_breakpad::ExceptionHandler::WriteMinidump toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:811 7 xul.dll CrashReporter::CreatePairedMinidumps toolkit/crashreporter/nsExceptionHandler.cpp:2085 8 xul.dll mozilla::plugins::PluginModuleParent::ShouldContinueFromReplyTimeout dom/plugins/PluginModuleParent.cpp:254 9 xul.dll mozilla::ipc::SyncChannel::ShouldContinueFromTimeout ipc/glue/SyncChannel.cpp:261 10 xul.dll mozilla::ipc::RPCChannel::Call ipc/glue/RPCChannel.cpp:210 11 xul.dll mozilla::plugins::PPluginInstanceParent::CallUpdateWindow obj-firefox/ipc/ipdl/PPluginInstanceParent.cpp:793 It would be great if we could skip some of this, like the crash reporter, and get CallUpdateWindow in the signature.
On the flip side, this is great to see: #27 hang | KiFastSystemCallRet | NtUserWaitMessage | InternalDialogBox 0 ntdll.dll KiFastSystemCallRet 1 user32.dll NtUserWaitMessage 2 user32.dll InternalDialogBox 3 user32.dll SoftModalMessageBox 4 user32.dll MessageBoxWorker 5 user32.dll MessageBoxTimeoutW 6 user32.dll MessageBoxExW 7 user32.dll MessageBoxW 8 NPSWF32.dll F840896513__________________________________________________ We had some discussions about the use of message box this week but we had little data to show.
The parent-side hang signature issue is a regression, already covered by bug 630233.
This is the patch I think we want.
Attachment #509810 - Flags: review?(jmathies)
Comment on attachment 509810 [details] [diff] [review] Hide KiFastSystemCallRet and a few others, rev. 1 This look good to me. I think there are few additional apis we'll want in the append list but lets see how things look after we run with this config for a while. Robert is there any chance we can get this landed today so we'll have some results by monday? Also, is this code in hg?
Attachment #509810 - Flags: review?(jmathies) → review+
This is socorro SVN.
It's config, so we can get it into prod anytime. Rob, can you take care of this?
Just landed attachment 509810 [details] [diff] [review] in trunk: Committed revision 2913. This will go live on staging in the next 15 minutes. If I pull crashes out of production (e.g. the crash IDs in comment 0) and resubmit them to staging, can someone look on stage to tell if this is working (or tell me what to look for)? I am pretty new to this part of Socorro :)
Status: NEW → ASSIGNED
(In reply to comment #24) > Just landed attachment 509810 [details] [diff] [review] in trunk: > Committed revision 2913. > > This will go live on staging in the next 15 minutes. > > If I pull crashes out of production (e.g. the crash IDs in comment 0) and > resubmit them to staging, can someone look on stage to tell if this is working > (or tell me what to look for)? I am pretty new to this part of Socorro :) Sure. For a set of crashes, I'd just pull the top twenty or so from the current 'hang | KiFastSystemCallRet' list for beta 10: https://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A4.0b10&platform=windows&query_search=signature&query_type=exact&query=&date=02%2F04%2F2011%2013%3A47%3A10&range_value=1&range_unit=weeks&hang_type=hang&process_type=plugin&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1&admin=&signature=hang%20|%20KiFastSystemCallRet
(In reply to comment #25) > (In reply to comment #24) > > Just landed attachment 509810 [details] [diff] [review] in trunk: > > Committed revision 2913. > > > > This will go live on staging in the next 15 minutes. > > > > If I pull crashes out of production (e.g. the crash IDs in comment 0) and > > resubmit them to staging, can someone look on stage to tell if this is working > > (or tell me what to look for)? I am pretty new to this part of Socorro :) > > Sure. For a set of crashes, I'd just pull the top twenty or so from the current > 'hang | KiFastSystemCallRet' list for beta 10: > > https://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A4.0b10&platform=windows&query_search=signature&query_type=exact&query=&date=02%2F04%2F2011%2013%3A47%3A10&range_value=1&range_unit=weeks&hang_type=hang&process_type=plugin&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1&admin=&signature=hang%20|%20KiFastSystemCallRet Thanks, this is done. Here's a couple notes on resubmitting crashes, in case this helps someone in the future: # pull crashes from hbase for type in json dump; do cat bug631036-ooids.txt | while read ooid; do python /data/socorro/application/socorro/storage/hbaseClient.py -h $hbaseHost get_${type} "$ooid" > bug631036/${ooid}.${type}; done; done # submit to staging python /data/socorro/application/scripts/submitter.py -u https://crash-reports.stage.mozilla.com/submit -s bug631036/ > bug631036.log Since these were resubmitted to stage through collector they get new OOIDs; here they are. Please let me know if these look ok and I'll get IT to push the config change to prod: https://crash-stats.stage.mozilla.com/report/index/de217081-34c4-41bd-8bcc-143802110204 https://crash-stats.stage.mozilla.com/report/index/209934af-a230-47b1-b7b3-64def2110204 https://crash-stats.stage.mozilla.com/report/index/0ca9c888-0d9f-422b-91ad-5e0c32110204 https://crash-stats.stage.mozilla.com/report/index/c1c493a8-14c9-4eb6-90b6-b117a2110204 https://crash-stats.stage.mozilla.com/report/index/0ac8679f-368d-4e11-95e2-00e2f2110204 https://crash-stats.stage.mozilla.com/report/index/cf31c181-0b57-42a0-b6f2-b75002110204 https://crash-stats.stage.mozilla.com/report/index/1c3efbd3-3e57-474b-b01c-7fb6f2110204 https://crash-stats.stage.mozilla.com/report/index/dc5b9a38-83cd-422e-b44d-a8f0e2110204 https://crash-stats.stage.mozilla.com/report/index/1480041b-ec67-40fe-b620-24aa62110204 https://crash-stats.stage.mozilla.com/report/index/2b640294-ae31-4f34-b914-1c0712110204 https://crash-stats.stage.mozilla.com/report/index/7939dfe4-6cff-4dac-adba-cd0002110204 https://crash-stats.stage.mozilla.com/report/index/137ede53-89cf-48e3-abf6-da54c2110204 https://crash-stats.stage.mozilla.com/report/index/931f6aa3-6d2a-4c0a-ba9d-909792110204 https://crash-stats.stage.mozilla.com/report/index/1c7aa7a3-d510-426b-8bdf-cda672110204 https://crash-stats.stage.mozilla.com/report/index/f922ed29-a828-4318-87eb-c23332110204 https://crash-stats.stage.mozilla.com/report/index/186416de-261c-4cb5-b8db-8751c2110204 https://crash-stats.stage.mozilla.com/report/index/9e87f4fd-e3e4-459d-aa62-b87f02110204 https://crash-stats.stage.mozilla.com/report/index/9b9e29e6-02ab-468f-be17-d5f072110204
Hmm, these signatures are all showing up as Signature: hang | KiFastSystemCallRet Something isn't quite right because we've added KiFastSystemCallRet to the skip list.
(In reply to comment #27) > Hmm, these signatures are all showing up as > > Signature: hang | KiFastSystemCallRet > > Something isn't quite right because we've added KiFastSystemCallRet > to the skip list. Ah, we were actually overriding this on staging, I am fixing the staging config to be like prod now. Sorry about that, will resubmit and we can look again once that's done.
Comment on attachment 509810 [details] [diff] [review] Hide KiFastSystemCallRet and a few others, rev. 1 I noticed on the Socorro config dump that processor does at startup, this didn't look right: |RaiseException|_CxxThrowException|mozilla::ipc::RPCChannel::Call\(IPC::Message\*, IPC::Message\*\)KiFastSystemCallRet| >Index: scripts/config/processorconfig.py.dist >=================================================================== >--- scripts/config/processorconfig.py.dist (revision 2912) >+++ scripts/config/processorconfig.py.dist (working copy) >@@ -105,6 +105,11 @@ > 'RaiseException', > '_CxxThrowException', > 'mozilla::ipc::RPCChannel::Call\(IPC::Message\*, IPC::Message\*\)' missing comma ^ >+ 'KiFastSystemCallRet',
Attachment #509810 - Flags: review-
Seems to work with that in place on stage: https://crash-stats.stage.mozilla.com/report/index/63d2e411-34c0-44b1-9084-c6f2d2110204 I went ahead and landed this typo fix: Committed revision 2914. Please stop me if I'm going off the rails here :)
Attached patch typo fixSplinter Review
(In reply to comment #30) > Seems to work with that in place on stage: > > https://crash-stats.stage.mozilla.com/report/index/63d2e411-34c0-44b1-9084-c6f2d2110204 > > I went ahead and landed this typo fix: > Committed revision 2914. > > Please stop me if I'm going off the rails here :) That looks perfect!
Depends on: 631690
(In reply to comment #32) > (In reply to comment #30) > > Seems to work with that in place on stage: ... > That looks perfect! Filed bug 631690 to push to prod.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Blocks: 632209
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: