Closed Bug 1674462 Opened 4 years ago Closed 2 years ago

don't set a priority and severity for bugs filed for crashes in automation

Categories

(Tree Management :: Treeherder, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: aryx)

References

Details

Discussed with RyanVM: Crashes encountered during the automated tests should be part of the triage.

Let's set no priority and severity for now. Failing tests get P5 as priority and S4 as severity which gets them skipped during triage.

cc Gijs to shout 'STOP' is this is wrong.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #0)

Discussed with RyanVM: Crashes encountered during the automated tests should be part of the triage.

Let's set no priority and severity for now. Failing tests get P5 as priority and S4 as severity which gets them skipped during triage.

cc Gijs to shout 'STOP' is this is wrong.

rs=me

If we can also be better about clearing priority + severity + needinfo'ing the triage owner for intermittents that increase in frequency in general, even better.

Sorry that this didn't get implemented. Crash bugs automatically get severity S2 nowadays. Is there still interest in having bugs about intermittent CI crashes filed without a priority set?

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #2)

Sorry that this didn't get implemented. Crash bugs automatically get severity S2 nowadays. Is there still interest in having bugs about intermittent CI crashes filed without a priority set?

I think so, then they show up in triage... I'm a bit confused because I recall seeing bugs in triage recently - what's the current state?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(aryx.bugmail)

Ah, because treeherder sets a severity (S4), that gets used. And the priority is not set for crashes.

Flags: needinfo?(aryx.bugmail)

So is this WFM at this point? Or do you think we should not set severity, either?

Flags: needinfo?(aryx.bugmail)

Your call - you likely have better insight how untriaged CI crashes shall look like.

Flags: needinfo?(aryx.bugmail)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #6)

Your call - you likely have better insight how untriaged CI crashes shall look like.

(link for context)

So in the last 3 months, 160-odd crash bugs got filed. 4 got resolved incomplete, 21 dupes, 9 fixed. 2 of the 128 open bugs are assigned.

From a random sample of 11 bugs, only 1 had more than 0 crashes in the last few weeks. So it seems like the crashes that we find rarely repeat frequently enough for them to be a problem on infra. In fact, the crash that I ran into that has non-0 crashes in the last week (bug 1713969) is just us killing the process because it hangs / times out running the test, on android. The stack there looks bogus to me.

17 of the 128 open bugs have a priority set; most of them are P5s.

This leaves around 110 bugs not prioritized at all. I know that in the frontend components I triage, we filter for bugs where either priority or severity is missing, but it looks like either other components don't necessarily do so, or that they are under-triaged.

So how to proceed here... I think not setting S4 would probably get some more eyes on bugs, but I'm not sure if that will really help.

Based on something Aryx said on matrix, I wonder if instead we should talk to the crash triage folks about whether we could correlate things better (e.g. being able to highlight infra crashes that correspond to user-experienced crashes). Gabriele, is that something we could do/automate?

I'm also curious if we could get some kind of automated alerting to happen - if an intermittent crasher exceeds 5 crashes a week, needinfo the triage owner. Aryx, do you have thoughts on that?

(In reply to :Gijs (he/him) from comment #1)

If we can also be better about clearing priority + severity + needinfo'ing the triage owner for intermittents that increase in frequency in general, even better.

FWIW, this is still something I'd like us to be better about, but I don't know how to help make that happen. I think this is probably higher value than what we do with the generally-rare/infrequent intermittent crash bugs in general.

Flags: needinfo?(gsvelto)
Flags: needinfo?(aryx.bugmail)

(In reply to :Gijs (he/him) from comment #7)

I'm also curious if we could get some kind of automated alerting to happen - if an intermittent crasher exceeds 5 crashes a week, needinfo the triage owner. Aryx, do you have thoughts on that?

Yes, that's possible with https://github.com/mozilla/relman-auto-nag and Treeherder API end points (the former has more functionality to interact with Bugzilla, hence it will likely live there).

Flags: needinfo?(aryx.bugmail)

(In reply to :Gijs (he/him) from comment #7)

Based on something Aryx said on matrix, I wonder if instead we should talk to the crash triage folks about whether we could correlate things better (e.g. being able to highlight infra crashes that correspond to user-experienced crashes). Gabriele, is that something we could do/automate?

Yes, that's possible. Looking through a few of the crashes one things that stands out is that the signature is awfully specific because it contains the offset from the start of the function where the crash happened, e.g. [@ PR_NativeRunThread + 0x16a]. This will prevent crashes from clumping under the same signature as different builds will have different offsets (especially on try where users push in-progress work). Additionally these crashes will never correlate with crashes in Socorro because the signature generation is different.

We have a number of bug on files for this topic (bug 828452, bug 867571 and bug 1280658 off the top of my mind). Will has already written a stand-alone python module to produce Socorro crash signatures outside of Socorro so we've got that part already. Wiring it up into our automation code is a simple matter coding™ but we haven't found anybody with the spare cycles to do it yet. IIUC fixing bug 867571 as discussed there would fix the issue of these crashes not aggregating correctly (and not correlating to the ones sent by our users).

Flags: needinfo?(gsvelto)
Component: Treeherder: Log Parsing & Classification → TreeHerder

(In reply to :Gijs (he/him) from comment #1)

If we can also be better about clearing priority + severity + needinfo'ing the triage owner for intermittents that increase in frequency in general, even better.

This is also something that we could implement easily in autonag, but it would increase noise and would be more work for triage owners. Do you still think we should do it?

Regarding the initial goal of this bug, now that triage processes are more well defined, I think it would make sense to avoid setting S4 for crashes caught in CI and let triage run its course and decide the actual severity. Especially for things caught by sanitizers like ASAN and TSAN.

(In reply to Marco Castelluccio [:marco] from comment #10)

(In reply to :Gijs (he/him) from comment #1)

If we can also be better about clearing priority + severity + needinfo'ing the triage owner for intermittents that increase in frequency in general, even better.

This is also something that we could implement easily in autonag, but it would increase noise and would be more work for triage owners. Do you still think we should do it?

I think for automated tests that end up with "owner needed" or "disable recommended" whiteboard tags from the intermittent bot, we should clear sev/prio and needinfo the triage owner, yes. Basically I think the triage owner should have seen the issue at least once before the test gets disabled, though of course they may still decide it doesn't warrant taking time out of other work to address. Does that sound reasonable?

Flags: needinfo?(mcastelluccio)

Sounds good, I just remembered we already have a similar issue on file: https://github.com/mozilla/relman-auto-nag/issues/1126.

Flags: needinfo?(mcastelluccio)

dkl, is it possible to use the Bugzilla API to create bugs with the crash keyword and the severity --? We'd like to file bugs for CI failures without severity if it is a crash.

Flags: needinfo?(dkl)

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #13)

dkl, is it possible to use the Bugzilla API to create bugs with the crash keyword and the severity --? We'd like to file bugs for CI failures without severity if it is a crash.

Unfortunately not the way BMO is currently coded. The default severity is already set to '--' and if a bug has the 'crash' keyword it will automatically set the severity to 'S2'. If you were setting it to anything other than '--' than you could do it.

https://github.com/mozilla-bteam/bmo/blob/master/extensions/MozChangeField/lib/Post/CrashKeywordSetSeverity.pm

Flags: needinfo?(dkl)
Depends on: 1728615

https://github.com/mozilla/treeherder/pull/7527 fixed this, though until bug 1728615 is fixed we are wrongly filing intermittent crashes as S2.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.