Closed
Bug 631036
Opened 14 years ago
Closed 14 years ago
Update irrelevantSignatureRegEx filter to better break down hang detector stacks
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jimm, Assigned: rhelmer)
References
()
Details
Attachments
(3 files, 1 obsolete file)
485 bytes,
text/plain
|
Details | |
917 bytes,
patch
|
jimm
:
review+
rhelmer
:
review-
|
Details | Diff | Splinter Review |
565 bytes,
patch
|
Details | Diff | Splinter Review |
Spun off from bug 630230.
'KiFastSystemCallRet' lands on the top of the signature for hang detector hangs we record in crash-stats. This makes breaking these stacks down based on different causes really hard. As a result we've not spent much time digging into these, and vendors like Adobe have largely passed these hangs over. For example, we currently have 7724 flash hangs lumped together as the #1 top crasher in beta 10 for the last week. There are hangs in here that are minor and there are hangs in here that occur regularly. What we need to do is separate these out and group them together, so we can try and fix the worst offenders.
A typical example:
https://crash-stats.mozilla.com/report/index/9c1f63d7-ab81-4025-bbe4-5e4132110201
0 ntdll.dll KiFastSystemCallRet
1 user32.dll NtUserSetFocus
2 user32.dll MB_DlgProc
3 user32.dll InternalCallWinProc
preferred - hang | NtUserSetFocus
https://crash-stats.mozilla.com/report/index/692a6f94-0359-4515-a383-4c39a2110201
0 ntdll.dll KiFastSystemCallRet
1 user32.dll NtUserWaitMessage
2 user32.dll InternalDialogBox
preferred - hang | InternalDialogBox
0 ntdll.dll KiFastSystemCallRet
1 ntdll.dll NtWaitForSingleObject
2 ntdll.dll RtlpWaitOnCriticalSection
3 ntdll.dll RtlpDeCommitFreeBlock
4 NPSWF32.dll F_1706698279__________________________________
preferred - hang | F_1706698279__________________________________
ntdll.dll KiFastSystemCallRet
1 user32.dll NtUserCallHwndLock
2 xul.dll mozilla::plugins::PluginInstanceChild::AnswerUpdateWindow
preferred - hang | AnswerUpdateWindow
Working up a list of these common frames should be easy. The list would hold most of the common win api calls currently in prefixSignatureRegEx (bug 630230). The key here is that we should only ignore these calls at the top of the stack, once something unrecognized shows up, we start recording the signature. Also, when listing the stack in the crash report page, the entire stack should be provided. These apis should be hidden from developers looking at the crash.
![]() |
Reporter | |
Comment 1•14 years ago
|
||
* 'should be hidden' -> '*shouldn't* be hidden'
![]() |
Reporter | |
Comment 2•14 years ago
|
||
Is there any way to set something like this up and run some staged runs to see what the results look like? I can work up an initial list of calls to ignore.
![]() |
Reporter | |
Updated•14 years ago
|
![]() |
Reporter | |
Comment 3•14 years ago
|
||
This is based on a survey of the first 60 hangs under the current hang detector crash list.
![]() |
Reporter | |
Comment 4•14 years ago
|
||
Attachment #509283 -
Attachment is obsolete: true
Comment 5•14 years ago
|
||
here is what a sample of 100 different KiFastSystemCallRet stacks looks like.
after the filtering we should get a similar redistribution of stacks, but with new signatures.
Comment 6•14 years ago
|
||
Comment 7•14 years ago
|
||
I really don't think we want to hide most of these frames, just append them as we're doing in bug 630230. So the signature would be "hang | KiFastSystemCallRet | NtUserSetFocus" which more accurately identifies that NtUserSetFocus is blocked on a system call. You still get the unique signatures.
Note that this still won't help a whole lot with hang analysis, because what really matters is the combination of the parent and child stacks. That's what the hang report used to do before it broke with the socorro upgrade, and something we should probably do again as a map/reduce job of some sort.
Comment 8•14 years ago
|
||
What is the specific request being made here? It's kinda trivial to make a patch once we have a list of frames which should be ignored or appended.
Updated•14 years ago
|
Assignee: nobody → lars
Updated•14 years ago
|
Assignee: lars → rhelmer
Comment 9•14 years ago
|
||
I think we should skip KiFastSystemCallRet, it doesn't add any value. The next function on the stack is the actual API you're calling, knowing that you're in a syscall doesn't help that much, I don't think (plus you can see it by looking at the actual report if you care anyway).
Assignee | ||
Comment 10•14 years ago
|
||
Is the list in attachment 509284 [details] ready to go, or do we need more discussion?
I am standing by to review, help test in stage, and get this into production; just let me know.
Comment 11•14 years ago
|
||
No, we definitely do not want to add everything in the attachment to the skip list. Many of those were already added to the append list, although I'm going through the data from today and might move a few of those from the skiplist to the append list.
Comment 12•14 years ago
|
||
Argh, I meant from the append list to the skip list.
![]() |
Reporter | |
Comment 13•14 years ago
|
||
(In reply to comment #11)
> No, we definitely do not want to add everything in the attachment to the skip
> list. Many of those were already added to the append list, although I'm going
> through the data from today and might move a few of those from the skiplist to
> the append list.
Skipping that entire list would get us down to the real call we're interested in. My understanding of the append list is a bit hazy, but it seems we should move most of the append list to the skip list. What we want to avoid is multiple hangs grouped under these generic calls on crash-stats.
Comment 14•14 years ago
|
||
We really do need to know what's at the top. The whole point of the append list is to let you do that. See crashes from what was deployed yesterday, e.g.
https://crash-stats.mozilla.com/report/index/00560eab-4c5e-458e-b646-456282110203
which has a signature of
hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA
Now it's probably safe to remove KiFastSystemCallRet, but knowing that we're stopped in NtWaitForMultipleObjects is IMO important. We should put as few things as we can in the skiplist, and as much as possible in the appendlist.
![]() |
Reporter | |
Comment 15•14 years ago
|
||
(In reply to comment #14)
> We really do need to know what's at the top. The whole point of the append list
> is to let you do that. See crashes from what was deployed yesterday, e.g.
>
> https://crash-stats.mozilla.com/report/index/00560eab-4c5e-458e-b646-456282110203
>
> which has a signature of
>
> hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA
Ok, this sounds great.
On this crash signature, would this be grouped / ranked on crash-stats as a crash in 'hang | KiFastSystemCallRet', 'hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA' or 'hang | CreateFileMappingA'?
('hang | KiFastSystemCallRet' is what we want to avoid.)
![]() |
Reporter | |
Comment 16•14 years ago
|
||
(In reply to comment #15)
> (In reply to comment #14)
> > We really do need to know what's at the top. The whole point of the append list
> > is to let you do that. See crashes from what was deployed yesterday, e.g.
> >
> > https://crash-stats.mozilla.com/report/index/00560eab-4c5e-458e-b646-456282110203
> >
> > which has a signature of
> >
> > hang | KiFastSystemCallRet | NtWaitForMultipleObjects | CreateFileMappingA
>
> Ok, this sounds great.
>
> On this crash signature, would this be grouped / ranked on crash-stats as a
> crash in 'hang | KiFastSystemCallRet', 'hang | KiFastSystemCallRet |
> NtWaitForMultipleObjects | CreateFileMappingA' or 'hang | CreateFileMappingA'?
>
> ('hang | KiFastSystemCallRet' is what we want to avoid.)
Ah, nm, I'm seeing these broken down now on crash stats since the update.
![]() |
Reporter | |
Comment 17•14 years ago
|
||
I think we need to tweek these settings though, for example:
[@ hang | KiFastSystemCallRet | ZwWaitForSingleObject | WaitForSingleObjectEx | WaitForSingleObject | google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(_EXCEPTION_POINTERS*, MDRawAssertionInfo*) ]
the stack is:
0 ntdll.dll KiFastSystemCallRet
1 ntdll.dll ZwWaitForSingleObject
2 kernel32.dll WaitForSingleObjectEx
3 kernel32.dll WaitForSingleObject
4 xul.dll google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:754
5 xul.dll google_breakpad::ExceptionHandler::WriteMinidumpForException toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:769
6 xul.dll google_breakpad::ExceptionHandler::WriteMinidump toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc:811
7 xul.dll CrashReporter::CreatePairedMinidumps toolkit/crashreporter/nsExceptionHandler.cpp:2085
8 xul.dll mozilla::plugins::PluginModuleParent::ShouldContinueFromReplyTimeout dom/plugins/PluginModuleParent.cpp:254
9 xul.dll mozilla::ipc::SyncChannel::ShouldContinueFromTimeout ipc/glue/SyncChannel.cpp:261
10 xul.dll mozilla::ipc::RPCChannel::Call ipc/glue/RPCChannel.cpp:210
11 xul.dll mozilla::plugins::PPluginInstanceParent::CallUpdateWindow obj-firefox/ipc/ipdl/PPluginInstanceParent.cpp:793
It would be great if we could skip some of this, like the crash reporter, and get CallUpdateWindow in the signature.
![]() |
Reporter | |
Comment 18•14 years ago
|
||
On the flip side, this is great to see:
#27 hang | KiFastSystemCallRet | NtUserWaitMessage | InternalDialogBox
0 ntdll.dll KiFastSystemCallRet
1 user32.dll NtUserWaitMessage
2 user32.dll InternalDialogBox
3 user32.dll SoftModalMessageBox
4 user32.dll MessageBoxWorker
5 user32.dll MessageBoxTimeoutW
6 user32.dll MessageBoxExW
7 user32.dll MessageBoxW
8 NPSWF32.dll F840896513__________________________________________________
We had some discussions about the use of message box this week but we had little data to show.
Comment 19•14 years ago
|
||
The parent-side hang signature issue is a regression, already covered by bug 630233.
Comment 20•14 years ago
|
||
This is the patch I think we want.
Attachment #509810 -
Flags: review?(jmathies)
![]() |
Reporter | |
Comment 21•14 years ago
|
||
Comment on attachment 509810 [details] [diff] [review]
Hide KiFastSystemCallRet and a few others, rev. 1
This look good to me.
I think there are few additional apis we'll want in the append list but lets see how things look after we run with this config for a while.
Robert is there any chance we can get this landed today so we'll have some results by monday?
Also, is this code in hg?
Attachment #509810 -
Flags: review?(jmathies) → review+
Comment 22•14 years ago
|
||
This is socorro SVN.
Comment 23•14 years ago
|
||
It's config, so we can get it into prod anytime. Rob, can you take care of this?
Assignee | ||
Comment 24•14 years ago
|
||
Just landed attachment 509810 [details] [diff] [review] in trunk:
Committed revision 2913.
This will go live on staging in the next 15 minutes.
If I pull crashes out of production (e.g. the crash IDs in comment 0) and resubmit them to staging, can someone look on stage to tell if this is working (or tell me what to look for)? I am pretty new to this part of Socorro :)
Status: NEW → ASSIGNED
![]() |
Reporter | |
Comment 25•14 years ago
|
||
(In reply to comment #24)
> Just landed attachment 509810 [details] [diff] [review] in trunk:
> Committed revision 2913.
>
> This will go live on staging in the next 15 minutes.
>
> If I pull crashes out of production (e.g. the crash IDs in comment 0) and
> resubmit them to staging, can someone look on stage to tell if this is working
> (or tell me what to look for)? I am pretty new to this part of Socorro :)
Sure. For a set of crashes, I'd just pull the top twenty or so from the current 'hang | KiFastSystemCallRet' list for beta 10:
https://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A4.0b10&platform=windows&query_search=signature&query_type=exact&query=&date=02%2F04%2F2011%2013%3A47%3A10&range_value=1&range_unit=weeks&hang_type=hang&process_type=plugin&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1&admin=&signature=hang%20|%20KiFastSystemCallRet
Assignee | ||
Comment 26•14 years ago
|
||
(In reply to comment #25)
> (In reply to comment #24)
> > Just landed attachment 509810 [details] [diff] [review] in trunk:
> > Committed revision 2913.
> >
> > This will go live on staging in the next 15 minutes.
> >
> > If I pull crashes out of production (e.g. the crash IDs in comment 0) and
> > resubmit them to staging, can someone look on stage to tell if this is working
> > (or tell me what to look for)? I am pretty new to this part of Socorro :)
>
> Sure. For a set of crashes, I'd just pull the top twenty or so from the current
> 'hang | KiFastSystemCallRet' list for beta 10:
>
> https://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A4.0b10&platform=windows&query_search=signature&query_type=exact&query=&date=02%2F04%2F2011%2013%3A47%3A10&range_value=1&range_unit=weeks&hang_type=hang&process_type=plugin&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1&admin=&signature=hang%20|%20KiFastSystemCallRet
Thanks, this is done. Here's a couple notes on resubmitting crashes, in case this helps someone in the future:
# pull crashes from hbase
for type in json dump; do cat bug631036-ooids.txt | while read ooid; do python /data/socorro/application/socorro/storage/hbaseClient.py -h $hbaseHost get_${type} "$ooid" > bug631036/${ooid}.${type}; done; done
# submit to staging
python /data/socorro/application/scripts/submitter.py -u https://crash-reports.stage.mozilla.com/submit -s bug631036/ > bug631036.log
Since these were resubmitted to stage through collector they get new OOIDs; here they are. Please let me know if these look ok and I'll get IT to push the config change to prod:
https://crash-stats.stage.mozilla.com/report/index/de217081-34c4-41bd-8bcc-143802110204
https://crash-stats.stage.mozilla.com/report/index/209934af-a230-47b1-b7b3-64def2110204
https://crash-stats.stage.mozilla.com/report/index/0ca9c888-0d9f-422b-91ad-5e0c32110204
https://crash-stats.stage.mozilla.com/report/index/c1c493a8-14c9-4eb6-90b6-b117a2110204
https://crash-stats.stage.mozilla.com/report/index/0ac8679f-368d-4e11-95e2-00e2f2110204
https://crash-stats.stage.mozilla.com/report/index/cf31c181-0b57-42a0-b6f2-b75002110204
https://crash-stats.stage.mozilla.com/report/index/1c3efbd3-3e57-474b-b01c-7fb6f2110204
https://crash-stats.stage.mozilla.com/report/index/dc5b9a38-83cd-422e-b44d-a8f0e2110204
https://crash-stats.stage.mozilla.com/report/index/1480041b-ec67-40fe-b620-24aa62110204
https://crash-stats.stage.mozilla.com/report/index/2b640294-ae31-4f34-b914-1c0712110204
https://crash-stats.stage.mozilla.com/report/index/7939dfe4-6cff-4dac-adba-cd0002110204
https://crash-stats.stage.mozilla.com/report/index/137ede53-89cf-48e3-abf6-da54c2110204
https://crash-stats.stage.mozilla.com/report/index/931f6aa3-6d2a-4c0a-ba9d-909792110204
https://crash-stats.stage.mozilla.com/report/index/1c7aa7a3-d510-426b-8bdf-cda672110204
https://crash-stats.stage.mozilla.com/report/index/f922ed29-a828-4318-87eb-c23332110204
https://crash-stats.stage.mozilla.com/report/index/186416de-261c-4cb5-b8db-8751c2110204
https://crash-stats.stage.mozilla.com/report/index/9e87f4fd-e3e4-459d-aa62-b87f02110204
https://crash-stats.stage.mozilla.com/report/index/9b9e29e6-02ab-468f-be17-d5f072110204
![]() |
Reporter | |
Comment 27•14 years ago
|
||
Hmm, these signatures are all showing up as
Signature: hang | KiFastSystemCallRet
Something isn't quite right because we've added KiFastSystemCallRet
to the skip list.
Assignee | ||
Comment 28•14 years ago
|
||
(In reply to comment #27)
> Hmm, these signatures are all showing up as
>
> Signature: hang | KiFastSystemCallRet
>
> Something isn't quite right because we've added KiFastSystemCallRet
> to the skip list.
Ah, we were actually overriding this on staging, I am fixing the staging config to be like prod now. Sorry about that, will resubmit and we can look again once that's done.
Assignee | ||
Comment 29•14 years ago
|
||
Comment on attachment 509810 [details] [diff] [review]
Hide KiFastSystemCallRet and a few others, rev. 1
I noticed on the Socorro config dump that processor does at startup, this didn't look right:
|RaiseException|_CxxThrowException|mozilla::ipc::RPCChannel::Call\(IPC::Message\*, IPC::Message\*\)KiFastSystemCallRet|
>Index: scripts/config/processorconfig.py.dist
>===================================================================
>--- scripts/config/processorconfig.py.dist (revision 2912)
>+++ scripts/config/processorconfig.py.dist (working copy)
>@@ -105,6 +105,11 @@
> 'RaiseException',
> '_CxxThrowException',
> 'mozilla::ipc::RPCChannel::Call\(IPC::Message\*, IPC::Message\*\)'
missing comma ^
>+ 'KiFastSystemCallRet',
Attachment #509810 -
Flags: review-
Assignee | ||
Comment 30•14 years ago
|
||
Seems to work with that in place on stage:
https://crash-stats.stage.mozilla.com/report/index/63d2e411-34c0-44b1-9084-c6f2d2110204
I went ahead and landed this typo fix:
Committed revision 2914.
Please stop me if I'm going off the rails here :)
Assignee | ||
Comment 31•14 years ago
|
||
![]() |
Reporter | |
Comment 32•14 years ago
|
||
(In reply to comment #30)
> Seems to work with that in place on stage:
>
> https://crash-stats.stage.mozilla.com/report/index/63d2e411-34c0-44b1-9084-c6f2d2110204
>
> I went ahead and landed this typo fix:
> Committed revision 2914.
>
> Please stop me if I'm going off the rails here :)
That looks perfect!
Assignee | ||
Comment 33•14 years ago
|
||
(In reply to comment #32)
> (In reply to comment #30)
> > Seems to work with that in place on stage:
...
> That looks perfect!
Filed bug 631690 to push to prod.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
![]() |
Reporter | |
Comment 34•14 years ago
|
||
A search across the last few hours looks great.
https://crash-stats.mozilla.com/query/query?product=Firefox&version=ALL%3AALL&platform=windows&range_value=3&range_unit=hours&date=02%2F05%2F2011+03%3A45%3A52&query_search=signature&query_type=exact&query=&build_id=&process_type=plugin&hang_type=hang&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1
Thanks for all the help!
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•