Minimise the sheriff impact from having to reopen low-frequency intermittent failure bugs

NEW
Unassigned

Status

Tree Management
OrangeFactor
--
enhancement
2 months ago
2 months ago

People

(Reporter: aryx, Unassigned)

Tracking

Details

• Bugs set to ‘RESOLVED’ are styled as striked through in the failure suggestions-
    • If a failure has a suggestion which is a bug resolved as ‘INCOMPLETE’ or ‘WORKSFORME’ (e.g. because no failure in the last 3 weeks has been classified against it), the bug has to be set to ‘REOPENED’.
    • The developers don’t have the link to OrangeFactor enabled by default in Bugzilla, posting a link to the new failure will make it obvious.

Comment 1

2 months ago
Hi! I'm not sure Treeherder should do this, since people really need to be looking at the bug properly before reopening it. It otherwise encourages people to just classify a failure against something existing without checking.
Nor should they have been closed automatically without looking at things like whether they are being closed due to not being seen for two weeks when they have only been hit every four weeks through their entire lifespan. But they are, this is the world we now live in, where I have to reopen ten or twenty bugs every day. I don't think this suggestion goes far enough, I think OF should reopen when it comments in an INCO bug so we can just star them without having to do more than hover to see that a strikethrough is INCO. And that is also the existing world we already live in, probably half the INCO I reopen have OF comments after the closing.

Comment 3

2 months ago
Oh wow that is a lot of bugs a day, I can see how that's annoying.

Just looking at the overall picture for a second - it seems like there are the following categories of bugs for intermittent failures:
1) Those that are occurring right now and very frequently
2) Those that used to occur frequently but have been fixed either consciously or by fortunate accident
3) Those that occur just rarely enough that the OF bot will be regularly closing them and a human having to reopen
4) Those that occur only ever once or twice and never seen again

In the future when the canonical location for tracking intermittent failures is a crash-stats like page, which links to bugs for only the most severe cases - then cases 3 & 4 would never have bugs filed, avoiding this problem.

However for now, since the only way bug suggestions work is by having a bug, bugs have to be filed for 3/4 - but the intention is for those bugs to be ignored by developers. The current approach for achieving that is to close the bugs, which works fine for case 4 (since it will never be reopened), but not so great for case 3.

Short of the crash-stats-like solution, possible ways of handling this are:
a) Let people manually reopen bugs (either by a button in Treeherder or visiting the bug)
b) Encourage people to classify against closed bugs, and then have the OrangeFactor daily/weekly summary tool automatically open bugs if the failure rate exceeds the threshold
c) Wait longer before closing bugs (how long is the current threshold?)

Now (A) is a pain for the sheriffs, and B conditions people to classifying against closed bugs, which seems like it's just going to increase the risk of mis-stars against unrelated ancient bugs.

As such, I think C is the best fix. 

Geoff/Joel, do you have any thoughts?
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
we typically way 21+ days to close a bug with no comment or failures. I like the option of A as well- we could wait 28 days with no failures and then resolve as incomplete- typically when something is resolved the problem has gone away and when something shows up 25 days later, it could very likely have the same failure, but be a different problem.  There is no right answer here, but I am happy to adjust things as we see fit.
Flags: needinfo?(jmaher)

Comment 5

2 months ago
I think at both ends of the scale (too short a number of days, and too long) there is risk of mis-starring. At the too short end, people will become conditioned to reopen bugs without even looking at the logs, and at the too long end so many crufty bugs will be open that failures will be classified against things that long since stopped occurring.

However if Phil is saying that he's reopening 10-20 bugs a day just on his own, then I think perhaps the threshold needs to be raised?
I am cc'd on about 2000 intermittent bugs and only see 5 or 6/week reopened, so these bugs that are reopening are low frequency failures.  But :philor would know the full scope of this- and I would be happy to extend the duration- lets agree on what to extend:
* 28 days
* 35 days
* 42 days

?

Comment 7

2 months ago
I think C is the first thing we should look at. Also, maybe check a few examples to verify that it has been implemented the way we intend?

There's also merit to the idea of OF robot opening a closed bug with significant failures. I don't want to encourage people to rely on that mode of operation, but if the robot encounters that condition, it seems reasonable for it to re-open the bug. I don't mind implementing that if we agree to it.
Flags: needinfo?(gbrown)

Comment 8

2 months ago
(In reply to Geoff Brown [:gbrown] from comment #7)
> Also, maybe check a few
> examples to verify that it has been implemented the way we intend?

https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&list_id=13915166&resolution=---&chfieldto=Now&chfield=bug_status&query_format=advanced&chfieldfrom=2017-11-06&bug_status=REOPENED

...that includes some bugs not closed by the Bug Husbandry bot, but it provides lots of applicable examples.

Bug 1299034 for example -- looks like the bot is working as intended, but creating extra work for sheriffs.
could we model how long would be ideal to have 90% accuracy in close rate?

Updated

2 months ago
Component: Treeherder: Log Parsing & Classification → OrangeFactor
Summary: add button to reopen bugs resolved as 'incomplete' or 'worksforme' → Minimise the sheriff impact from having to reopen low-frequency intermittent failure bugs
You need to log in before you can comment on or make changes to this bug.