Improve granularity of shutdownhang signatures

RESOLVED FIXED

Status

task
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jimm, Assigned: willkg)

Tracking

Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

()

Attachments

(3 attachments)

Calls to add:

0 	ntdll.dll 	KiFastSystemCallRet 	
1 	ntdll.dll 	ZwWaitForKeyedEvent 	
2 	ntdll.dll 	RtlSleepConditionVariableCS 	
3 	kernel32.dll 	SleepConditionVariableCS
Oh wow--that wiki page is super out of date.

To clarify, you want to add these four strings to the prefix list which causes signature generation to continue to the next frame. Is that right?

Is this a good example crash?: https://crash-stats.mozilla.com/report/index/1a796172-3898-4f5b-a699-4da0c0170514
Flags: needinfo?(jmathies)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #1)
> Oh wow--that wiki page is super out of date.
> 
> To clarify, you want to add these four strings to the prefix list which
> causes signature generation to continue to the next frame. Is that right?
> 
> Is this a good example crash?:
> https://crash-stats.mozilla.com/report/index/1a796172-3898-4f5b-a699-
> 4da0c0170514

I believe so yes. In addition to the first list - 

KiFastSystemCallRet 	
ZwWaitForKeyedEvent 	
RtlSleepConditionVariableCS 	
SleepConditionVariableCS

from bug 1389021, lets also add:

NtWaitForAlertByThreadId -
https://crash-stats.mozilla.com/report/index/866ff980-2871-4ed9-8b7f-3a5ec0170921

NtWaitForKeyedEvent -
https://crash-stats.mozilla.com/report/index/d7a8fc7d-06b5-4001-99cf-278190170921

ZwWaitForAlertByThreadId -
https://crash-stats.mozilla.com/report/index/9acb7c15-10b5-4a58-a439-bfbaa0170921

Ted, would you mind sanity checking these? Benjamin would normally looked these over. :/
Flags: needinfo?(jmathies) → needinfo?(ted)
We have a fancy-schmantzy new signature generation tester, so I can show you what happens to signatures for x crashes for each string you want to try if that helps. It's super easy.
Looking at the prefix list, we have a .*WaitFor.* rule which should cover:

ZwWaitForKeyedEvent
NtWaitForAlertByThreadId
NtWaitForKeyedEvent
ZwWaitForAlertByThreadId


We also have an Rtl.* rule which should cover:

RtlSleepConditionVariableCS


That leaves:

KiFastSystemCallRet
SleepConditionVariableCS

Crash d7a8fc7d-06b5-4001-99cf-278190170921 is stopped on NtQueryPerformanceCounter which isn't in the list of things to add above.


Adding SleepConditionVariableCS should fix these crashes that you mentioned:

https://crash-stats.mozilla.com/report/index/866ff980-2871-4ed9-8b7f-3a5ec0170921
https://crash-stats.mozilla.com/report/index/9acb7c15-10b5-4a58-a439-bfbaa0170921
https://crash-stats.mozilla.com/report/index/1a796172-3898-4f5b-a699-4da0c0170514


Here's a super search for the last 6 months for KiFastSystemCallRet:

https://crash-stats.mozilla.com/search/?signature=~KiFastSystemCallRet&product=Firefox&date=%3E%3D2017-03-21T23%3A39%3A15.000Z&date=%3C2017-09-21T23%3A39%3A15.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

One group of these has a single frame which won't be affected by these changes. Like this crash:

https://crash-stats.mozilla.com/report/index/fb2c2169-65b6-436a-9c9c-597c20170917

The other group are all shutdown hangs and the ones I looked at have three frames, so those will improve. Like this crash:

https://crash-stats.mozilla.com/report/index/d57ed082-7de4-4b5e-ac2d-4f9350170802


So I suggest only adding these two:

KiFastSystemCallRet
SleepConditionVariableCS
I would suggest putting KiFastSystemCallRet in irrelevant_signature_re.txt. It doesn't really tell you anything of value, it's just "this thread is in a system call".
Flags: needinfo?(ted)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #1)
> Oh wow--that wiki page is super out of date.

I replaced it with a link to the Socorro signature generation docs.
(In reply to Ted Mielczarek [:ted.mielczarek] (Away Sep 23rd-Oct 1st) from comment #5)
> I would suggest putting KiFastSystemCallRet in irrelevant_signature_re.txt.
> It doesn't really tell you anything of value, it's just "this thread is in a
> system call".

sound good to me, over to Will.
KiFastSystemCallRet was already in the irrelevant list, so I did a PR for adding SleepConditionVariableCS to the prefix list.

Jim, Ted: Does this output look like what you were hoping for?:

https://github.com/mozilla-services/socorro/pull/4013#issuecomment-331439138
Flags: needinfo?(ted)
Flags: needinfo?(jmathies)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #9)
> KiFastSystemCallRet was already in the irrelevant list, so I did a PR for
> adding SleepConditionVariableCS to the prefix list.
> 
> Jim, Ted: Does this output look like what you were hoping for?:
> 
> https://github.com/mozilla-services/socorro/pull/4013#issuecomment-331439138

What does this warning mean?

"SignatureTool: signature truncated due to length"

Are we losing granularity here due to the length of the resulting sagnatures?
Flags: needinfo?(willkg)
Flags: needinfo?(ted)
Flags: needinfo?(jmathies)
It means that the signature length exceeded 255 characters and the signature generation system truncated it.

Yes, we're definitely losing granularity by truncating, but I think beyond 255 characters, the granularity is too small for bucketing to be useful.

If you see other things we can do to make those signatures better, it's definitely worth opening up bugs for those changes.

Does that output look like what you were hoping for?
Flags: needinfo?(willkg) → needinfo?(jmathies)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #11)
> It means that the signature length exceeded 255 characters and the signature
> generation system truncated it.
> 
> Yes, we're definitely losing granularity by truncating, but I think beyond
> 255 characters, the granularity is too small for bucketing to be useful.
> 
> If you see other things we can do to make those signatures better, it's
> definitely worth opening up bugs for those changes.
> 
> Does that output look like what you were hoping for?

Don't think it'll do any good as is, we end up with generic signatures here due to the truncation. Can we force the removal of some of these frames to shorten things up? For example:

nsEventQueue::GetEvent
nsThread::nsChainedEventQueue::GetEvent

Currently we end up with:

(generic frames) | +

nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent | nsThread::GetEvent ...
nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent | nsThread::GetEvent ...
nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent | nsThread::GetE...
nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent | nsThread::GetEvent ...
nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent | nsThread::GetEvent ...
nsEventQueue::GetEvent | nsThread::nsChainedEventQueue::GetEvent | nsThread::GetE...
Flags: needinfo?(jmathies)
I think adding SleepConditionVariableCS to the prefix list is better, but the signatures still aren't helpful.

I'm going to land my PR and get that pushed out. I'm also going to change the scope of this bug to "make shutdownhang signatures better".

I'll keep working on this today. I want to look at shutdownhang bugs and get a feel for what they look like and patterns and a better handle on "what is the interesting part of the stack that makes these actionable".
Summary: [skiplist] Add SleepConditionVariableCS related frames to the skiplist → make shutdownhang signatures better
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
The change for SleepConditionVariableCS went to -prod an hour ago.

I'm almost out of time for today. I'll get to more shutdownhang work on Monday.
Reopening this because it's not done, yet. Sorry about that.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: nobody → willkg
Status: REOPENED → ASSIGNED
Duplicate of this bug: 1375511
We're still missing soem granularity here, maybe we can do something to address it. For example this signature:

shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS | mozilla::detail::ConditionVariableImpl::wait | mozilla::CondVar::Wait | mozilla::ThreadEventQueue<T>::GetEvent 

https://crash-stats.mozilla.com/signature/?product=Firefox&version=58.0a1&signature=shutdownhang%20%7C%20NtWaitForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS%20%7C%20mozilla%3A%3Adetail%3A%3AConditionVariableImpl%3A%3Await%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20mozilla%3A%3AThreadEventQueue%3CT%3E%3A%3AGetEvent&date=%3E%3D2017-09-16T20%3A20%3A11.000Z&date=%3C2017-09-23T20%3A20%3A11.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

Is a bucket of various timeout, for example:

mozilla::layers::CompositorThreadHolder::Shutdown()
https://crash-stats.mozilla.com/report/index/6d5cf3e2-2ab6-4063-b8be-f49fc0170923

mozilla::dom::quota::QuotaManager::ShutdownObserver::Observe()
https://crash-stats.mozilla.com/report/index/e623401f-d13e-4ed5-89ee-f3aeb0170923

mozilla::SharedThreadPoolShutdownObserver::Observe()
https://crash-stats.mozilla.com/report/index/d2024669-7473-4300-a697-2ccbe0170923

We could add to the ignore list I think to help here:

1) conditional variable frames like:

mozilla::detail::ConditionVariableImpl
mozilla::CondVar::Wait

2) default thread event processing

mozilla::ThreadEventQueue
nsThread::ProcessNextEvent
NS_ProcessNextEvent
Or we could ignore the left hand side, frames like 

NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS
No longer depends on: 1364798, 1389021
Another possible improvement - strip out namespace information to shorten signatures.
Summary: make shutdownhang signatures better → Improve granularity of shutdownhang signatures
I wrote up comment #13, but over the weekend became more aware that this is one of a myriad of similar bugs all driving at a similar purpose and that a lot of people might be interested. If I had know that earlier, I would have written a less terse comment #13. Sorry about that!

I purposely didn't use the word "granularity" in the summary of this bug. Shutdown hangs aren't really crashes. The stack for a shutdown hang situation is generally a bunch of uninteresting stuff and then the set of interesting/relevant frames. We're not looking for the point at which the code died, but rather what was going on while it was hanging. I don't think it makes sense to frame this around granularity (increasing the number of things we include) but rather framing this around relevancy (making sure signatures have the interesting/relevant frame as left-most as possible).

The signatures we were talking about prior to comment #13 all suffered from the same problem--there was so much generic stuff that the interesting/relevant bits get truncated. I don't know if that's because this bug was originally about signatures with SleepConditionVariableCS in them or whether this is true of many/most shutdownhang bugs.

I talked to Ted briefly on Friday before writing up comment #13 and we threw around some ideas:


1. maintain a separate irrelevant list for shutdownhang

This would work like the other irrelevant list and frames that were in the list would get skipped and not be part of the signature.

That might work fine, but might be a lot of maintenance.


2. maintain a separate sentinel list for shutdownhang

This would work like the other sentinel list and frames in this list determine the first frame we should look at. Anything before that would get dropped.

Maybe it's possible for us to add an item to this list that fixes shutdownhangs but isn't shutdownhang-specific.


3. some shutdownhang-specific behavior encoded in Python

Jim's idea to strip out namespace information would be in this category.


I need to spend some time looking at all the recent shutdownhang signatures and summarize what we've got. Then I'll be in a better spot to talk about suggestions.
I spent the bulk of today looking at like 16,000 shutdownhang crashes and I think going with something outlined in comment #18 and comment #19 seems most helpful.

I was thinking maybe we had to do something shutdownhang specific, but I looked at these parts and we pretty much only see them in shutdownhang crashes.

I want to add the following to the irrelevant list:

mozilla::CondVar::Wait
mozilla::detail::ConditionVariableImpl.*
mozilla::ThreadEventQueue.*
nsEventQueue::GetEvent
nsThread::GetEvent
nsThread::nsChainedEventQueue::GetEvent
nsThread::ProcessNextEvent
NS_ProcessNextEvent
NtWaitForKeyedEvent
RtlSleepConditionVariableCS
RtlSleepConditionVariableSRW
SleepConditionVariableCS
SleepConditionVariableSRW
ZwWaitForKeyedEvent

I'll attach the change in signatures particularly ones that we want to target which are getting truncated because they're too long.

Does this make the shutdownhangs situation better? Should we make this change, see what happens, and then do another round later this week?

Also, should we reprocess shutdownhang crashes with signatures that got truncated over the last week?
Shows changes in signatures for 1000 crashes from 2017-09-24.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #22)
> I spent the bulk of today looking at like 16,000 shutdownhang crashes and I
> think going with something outlined in comment #18 and comment #19 seems
> most helpful.
> 
> I was thinking maybe we had to do something shutdownhang specific, but I
> looked at these parts and we pretty much only see them in shutdownhang
> crashes.
> 
> I want to add the following to the irrelevant list:
> 
> mozilla::CondVar::Wait
> mozilla::detail::ConditionVariableImpl.*
> mozilla::ThreadEventQueue.*
> nsEventQueue::GetEvent
> nsThread::GetEvent
> nsThread::nsChainedEventQueue::GetEvent
> nsThread::ProcessNextEvent
> NS_ProcessNextEvent
> NtWaitForKeyedEvent
> RtlSleepConditionVariableCS
> RtlSleepConditionVariableSRW
> SleepConditionVariableCS
> SleepConditionVariableSRW
> ZwWaitForKeyedEvent


SFTM.

> Does this make the shutdownhangs situation better? Should we make this
> change, see what happens, and then do another round later this week?

Yes.

> Also, should we reprocess shutdownhang crashes with signatures that got
> truncated over the last week?

Na, volume tends to refresh things pretty quickly.
bleh - s/SFTM/SGTM

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #23)
> Created attachment 8911968 [details]
> new_signatures_20170925.txt
> 
> Shows changes in signatures for 1000 crashes from 2017-09-24.

These look great!
That looks really good!

Could you also add mozilla::SpinEventLoopUntil<T> to the prefix list? There are a bunch of call sites that would be nice to distinguish. http://searchfox.org/mozilla-central/search?q=symbol:_ZN7mozilla18SpinEventLoopUntilEOT0_P9nsIThread&redirect=false
Andrew: Do you have an example crash for that?

Also, just to clarify, do you want to add mozilla::SpinEventLoopUntil<T> to the prefix list (signature continues past that frame) or the irrelevant list (frame is ignored and doesn't show up in the signature)?
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #27)
> Andrew: Do you have an example crash for that?

Here's one: bp-fd4655f3-d0f8-47dc-aff5-c9ecc0170926

> Also, just to clarify, do you want to add mozilla::SpinEventLoopUntil<T> to
> the prefix list (signature continues past that frame) or the irrelevant list
> (frame is ignored and doesn't show up in the signature)?

I think the prefix list makes sense. That way you can tell at a glance that the event loop is being spun. Though if you do that, you'd also want mozilla::SharedThreadPool::SpinUntilEmpty()  (which is rarer, but is seen for example here bp-182340c8-fbd3-4d94-b2e6-1207f0170924 )

Either that or they could be added to the irrelevant signature list, but nsThread::ProcessNextEvent or some other event thing could be on the prefix list instead of the irrelevant signature list.

Nathan, do you have any suggestions about which frame of the "spinning the event loop" boilerplate is kind of the most canonical? We'd want that on the skip list, and the rest of the boilerplate on the irrelevant list, so we could see at a glance that a hang was happening while spinning the event loop, and where it was happening, but without cluttering the signature too much.
Flags: needinfo?(nfroyd)
I meant to make this a P1 earlier.
Priority: -- → P1
Here's a crash that has a truncated signature that isn't fixed by the changes in comment #22:

10767d8d-e9c5-47c0-88cb-66e980170925

shutdownhang | NtWaitForAlertByThreadId | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | je_calloc | _PR_AttachThread | PR_NativeCreateThread | PR_CreateThrea...

Is that signature truncated before the interesting part?


Here's another one:

b0a9831f-1282-46f3-86b1-745810170925

shutdownhang | NtWaitForAlertByThreadId | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | PR_Lock | nssTrustDomain_RemoveTokenCertsFromCache | nssToken_NotifyC...

Is that signature truncated before the interesting part?
(In reply to Andrew McCreight [:mccr8] from comment #28)
> I think the prefix list makes sense. That way you can tell at a glance that
> the event loop is being spun. Though if you do that, you'd also want
> mozilla::SharedThreadPool::SpinUntilEmpty()  (which is rarer, but is seen
> for example here bp-182340c8-fbd3-4d94-b2e6-1207f0170924 )
>
> Nathan, do you have any suggestions about which frame of the "spinning the
> event loop" boilerplate is kind of the most canonical? We'd want that on the
> skip list, and the rest of the boilerplate on the irrelevant list, so we
> could see at a glance that a hang was happening while spinning the event
> loop, and where it was happening, but without cluttering the signature too
> much.

Did you mean "skip list" or "prefix list" for the canonical entry point(s)?

I think you want mozilla::SpinEventLoopUntil<T> and NS_ProcessNextEvent on a list that will show up in the signatures, so the event loop spinning (or at least waiting) is pretty obvious, and everything that would occur beneath those (nsThread::ProcessNextEvent, nsEventQueue::GetEvent, etc.) on the irrelevant list.

What would be *splendid* would be for these sorts of annotations to appear in the code themselves, and then processed out of the code for consumption by socorro, so we wouldn't have the problem of renaming things and having people panic because new crashes started appearing out of nowhere, and so people would have an inkling that these functions are important when they get renamed or potentially removed or what have you.  If we could produce some sort of build artifact with this sort of information, is it technically feasible for that to be imported into socorro and used for builds with particular build IDs?
Flags: needinfo?(nfroyd) → needinfo?(willkg)
> If we could produce some sort of build artifact with this sort of information, is it technically feasible for that to be imported into socorro and used for builds with particular build IDs?

I don't see why not, but maybe this is a case of "the devil is in the details"? If we want to explore this, we'd definitely want to do it in a separate bug.
Flags: needinfo?(willkg)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #30)
> shutdownhang | NtWaitForAlertByThreadId | RtlpWaitOnAddressWithTimeout |
> RtlpWaitOnAddress | RtlpWaitOnCriticalSection |
> RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | je_calloc |
> _PR_AttachThread | PR_NativeCreateThread | PR_CreateThrea...
> 
> Is that signature truncated before the interesting part?

I think a lot of that could be added to the irrelevant list. NtWaitForAlertByThreadId RtlpWaitOnAddressWithTimeout RtlpWaitOnAddress RtlpWaitOnCriticalSection RtlpEnterCriticalSectionContended RtlEnterCriticalSection to start with. Maybe some of the NSPR thread creation stuff.

> shutdownhang | NtWaitForAlertByThreadId | RtlpWaitOnAddressWithTimeout |
> RtlpWaitOnAddress | RtlpWaitOnCriticalSection |
> RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | PR_Lock |
> nssTrustDomain_RemoveTokenCertsFromCache | nssToken_NotifyC...
> 
> Is that signature truncated before the interesting part?

This looks like it would benefit from the same list of things being added to the irrelevant signature list.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #33)
> > If we could produce some sort of build artifact with this sort of information, is it technically feasible for that to be imported into socorro and used for builds with particular build IDs?
> 
> I don't see why not, but maybe this is a case of "the devil is in the
> details"? If we want to explore this, we'd definitely want to do it in a
> separate bug.

We are in violent agreement on this point.  I will file.
(In reply to Andrew McCreight [:mccr8] from comment #34)
> (In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #30)
> > shutdownhang | NtWaitForAlertByThreadId | RtlpWaitOnAddressWithTimeout |
> > RtlpWaitOnAddress | RtlpWaitOnCriticalSection |
> > RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | je_calloc |
> > _PR_AttachThread | PR_NativeCreateThread | PR_CreateThrea...
> > 
> > Is that signature truncated before the interesting part?
> 
> I think a lot of that could be added to the irrelevant list.
> NtWaitForAlertByThreadId RtlpWaitOnAddressWithTimeout RtlpWaitOnAddress
> RtlpWaitOnCriticalSection RtlpEnterCriticalSectionContended
> RtlEnterCriticalSection to start with. Maybe some of the NSPR thread
> creation stuff.

I'll do that in the next pass.
(In reply to Nathan Froyd [:froydnj] from comment #35)
> (In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #33)
> > > If we could produce some sort of build artifact with this sort of information, is it technically feasible for that to be imported into socorro and used for builds with particular build IDs?
> > 
> > I don't see why not, but maybe this is a case of "the devil is in the
> > details"? If we want to explore this, we'd definitely want to do it in a
> > separate bug.
> 
> We are in violent agreement on this point.  I will file.

Awesome--thank you!
So, to sum up where things are at here because I like summing things up:

1. PR 4017 (https://github.com/mozilla-services/socorro/pull/4017) covers the changes in comment #22. That'll get reviewed, land, and pushed to -prod soon.

2. In comment #26 and #28, Andrew suggested we add mozilla::SpinEventLoopUntil<T> to the prefix list. That needs to be worked on.

3. In comment #31, Nathan suggested annotating the code and generating a build artifact that Socorro could use which would reduce some of the manual maintenance we're doing to the lists. That'll be worked on in a separate bug.

4. In comment #34, Andrew suggested we add NtWaitForAlertByThreadId RtlpWaitOnAddressWithTimeout RtlpWaitOnAddress RtlpWaitOnCriticalSection RtlpEnterCriticalSectionContended RtlEnterCriticalSection to the irrelevant list. That needs to be worked on.

That comment also says "Maybe some of the NSPR thread creation stuff." That's cool, but I'd need someone to expand on that. We can do that in another pass, too.


Two things I want to mention. First, we're not going to solve all possible problems with this bug--we just want to make the situation better than it is now. I don't think anyone would argue with that, but it leads to my second issue. Second, I got a lot of other things I need to work on and I claim we're going to hit the 80/20 rule soon here. At that point, I think it's prudent to mark this FIXED (hopefully with raucous applause) and defer any outstanding work to new bugs.
Comment on attachment 8912298 [details] [review]
pr 4017 - add irrelevant strings

Review of attachment 8912298 [details] [review]:
-----------------------------------------------------------------

::: socorro/siglists/irrelevant_signature_re.txt
@@ +59,5 @@
>  RtlpAdjustHeapLookasideDepth
> +RtlSleepConditionVariableCS
> +RtlSleepConditionVariableSRW
> +SleepConditionVariableCS
> +SleepConditionVariableSRW

These look like thy can be combined assuming these filter rules work:

(Rtl|)SleepConditionVariable(CS|SRW)
Attachment #8912298 - Attachment is patch: true
Attachment #8912298 - Attachment mime type: text/x-github-pull-request → text/plain
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #38)
> So, to sum up where things are at here because I like summing things up:
> 
> 1. PR 4017 (https://github.com/mozilla-services/socorro/pull/4017) covers
> the changes in comment #22. That'll get reviewed, land, and pushed to -prod
> soon.

Yeah lets land what you have and resolve this bug out. We've done enough work here.

> 2. In comment #26 and #28, Andrew suggested we add
> mozilla::SpinEventLoopUntil<T> to the prefix list. That needs to be worked
> on.

-> follow up bug

> 3. In comment #31, Nathan suggested annotating the code and generating a
> build artifact that Socorro could use which would reduce some of the manual
> maintenance we're doing to the lists. That'll be worked on in a separate bug.

-> long term project, follow up bug

> 4. In comment #34, Andrew suggested we add NtWaitForAlertByThreadId
> RtlpWaitOnAddressWithTimeout RtlpWaitOnAddress RtlpWaitOnCriticalSection
> RtlpEnterCriticalSectionContended RtlEnterCriticalSection to the irrelevant
> list. That needs to be worked on.

-> low hanging fruit, follow up bug

> That comment also says "Maybe some of the NSPR thread creation stuff."
> That's cool, but I'd need someone to expand on that. We can do that in
> another pass, too.

-> lets see if we need it after we land current changes.

> Two things I want to mention. First, we're not going to solve all possible
> problems with this bug--we just want to make the situation better than it is
> now. I don't think anyone would argue with that, but it leads to my second
> issue. Second, I got a lot of other things I need to work on and I claim
> we're going to hit the 80/20 rule soon here. At that point, I think it's
> prudent to mark this FIXED (hopefully with raucous applause) and defer any
> outstanding work to new bugs.

Sounds good, thank Will!
Jim: We do reviews in github--not in Bugzilla comments. I can reply to it there.

Also, it looks like you broke the github link for that comment by changing the mime type. Can you change that back, please?
Attachment #8912298 - Attachment is patch: false
Attachment #8912298 - Attachment mime type: text/plain → text/x-github-pull-request
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/072704ba0079c47f63e909119f5ca53035e9662e
bug 1402037 - improve shutdownhang signatures (#4017)

* bug 1402037 - improve shutdownhang signatures

We have a number of shutdownhang signatures where the stack is rife with
"uninteresting" bits such that the signature gets truncated before getting to
something interesting or relevant or actionable.

This adds a bunch of things to the irrelevant list such that they don't show up
in the signature.

I did some regression testing and these strings only show up in shutdownhang
crashes. If that changes, we should rethink them on a string-by-string basis.

* Collapse two lines into one
(In reply to [:philipp] from comment #43)
> this is how the distribution of shutdownhang signatures is looking now :-)
> https://crash-stats.mozilla.com/search/
> ?signature=%5Eshutdownhang&date=%3E%3D2017-09-27T18%3A00%3A00.
> 000Z&_facets=signature&_facets=release_channel#facet-signature

Filtering out thunderbird - 

https://crash-stats.mozilla.com/search/?signature=%5Eshutdownhang&product=Firefox&date=%3E%3D2017-09-27T18%3A00%3A00.000Z&date=%3C2017-09-28T11%3A36%3A00.000Z&_sort=-date&_facets=signature&_facets=release_channel&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

These signatures look &*%@! awesome! Thanks Will!
Bug #1403255 covers Nathan's idea of producing a build artifact that Socorro can use to augment its signature generation related lists.

I think if signatures are much better now, then we should celebrate and close this bug out. The nature of this problem is that things change over time. When things get rough again, we can open up a new bug, see what's going on, and address those observations then.

I wrote up some other bugs to fix issues with the signature generation cli tooling to make analysis and iterating over changes faster/easier. I'll work on those over the next few months. That'll make future work like this easier.

Given that, I'm going to mark this FIXED. We can open up new bugs as needed. Thank you everyone!
Status: ASSIGNED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
See Also: → 1409425
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #38)
> 4. In comment #34, Andrew suggested we add NtWaitForAlertByThreadId
> RtlpWaitOnAddressWithTimeout RtlpWaitOnAddress RtlpWaitOnCriticalSection
> RtlpEnterCriticalSectionContended RtlEnterCriticalSection to the irrelevant
> list. That needs to be worked on.

Everything before RtlEnterCriticalSection is definitely uninteresting details of the Windows kernel in that stack you pointed out. I don't know if that wants to be a sentinel or what the best way to deal with that is.
You need to log in before you can comment on or make changes to this bug.