Open Bug 1299274 Opened 4 years ago Updated 6 months ago

Improve the classification of intermittent failures that aren't associated with just one test

Categories

(Tree Management :: Treeherder: Log Parsing & Classification, defect)

defect
Not set
normal

Tracking

(Not tracked)

REOPENED

People

(Reporter: RyanVM, Unassigned)

References

(Blocks 1 open bug)

Details

Since the days of TBPL, we've had recurring issues keeping track of "meta-failures" - i.e. widespread issues that have a large cumulative impact, but spread across a number of bugs, which understates the severity on OrangeFactor.

This happens because we rely on the bug summary to generate bug suggestions for intermittent failures, which quickly falls apart under some scenarios:
* Too many failures with a given signature, hitting Treeherder's suggestion limit (i.e. [@ mozalloc_abort]).
* Different failure modes with a common root cause (i.e. the Android mochitest-gl memory corruption issues we're currently hitting or the Windows plugin tests hangs or the current spate of Marionette e10s crashes) leaving insufficient room in the bug summary to encapsulate everything in a usable way.

Just in the past week, there have been multiple requests asking the sheriff team to manually star various failures under one main tracking bug, which is failure-prone and highly-likely to lead to inconsistent results. The path to consistent results is taking the path of least resistance - which means making Treeherder generate better bug suggestions in these scenarios.

My strawman proposal:
* Add a new meta-orange bug keyword (feel free to bikeshed the name to your heart's content).
* If intermittent-failure bug XXX is marked as blocking meta-orange bug YYY, have Treeherder suggest bug YYY in lieu of bug XXX.

Pros:
* This should Just Work with OrangeFactor. All the related issues will now becoming into OrangeFactor with one common bug number and they'll appear in the right place in the list as one aggregated "failure".
* This is easy to implement on the Bugzilla side.

Cons:
* It still relies on somebody noticing the common underlying problem and marking the dependency where applicable.
* It's Yet Another Layer of Complexity on top of the existing infra when the long-term goal is to abstract a lot of this off bug summaries and such (my understanding at least).
* Unclear how easily this could be implemented in Treeherder. How much of the current workflow is baked into it and how deeply?
* The basic proposal doesn't do anything for dependencies marked after the fact (i.e. failures previously annotated as bug XXX will forever show up that way in OrangeFactor even if newer ones go under bug YYY).

Other considerations:
* Ideally the Bug Filer will allow the setting of dependencies at the time of bug creation so even newly-filed bugs can be correctly annotated from the start. Though this means we'll be filing a bunch of "dummy" bugs which serve no purpose other than to create a link between a given failure line and the "real" bug we care about. Need to think about how the Treeherder Robot handles those.
* We'd probably want bug XXX to be available for starring *somewhere*, maybe under the "Show / Hide more" menu. More generally, we should probably put some thought into how to present this in the Treeherder UI.

In summary, I think the proposal outlined here gives a reasonable path forward Right Now that can help mitigate the current pain point while not requiring a ton of eventually-obsolete work to get stood up. I'd love to hear feedback from other sheriffs, TH developers, and other stakeholders!
(In reply to Ryan VanderMeulen [:RyanVM] from comment #0)
> Since the days of TBPL, we've had recurring issues keeping track of
> "meta-failures" - i.e. widespread issues that have a large cumulative
> impact, but spread across a number of bugs, which understates the severity
> on OrangeFactor.
> 
> This happens because we rely on the bug summary to generate bug suggestions
> for intermittent failures, which quickly falls apart under some scenarios:
> * Too many failures with a given signature, hitting Treeherder's suggestion
> limit (i.e. [@ mozalloc_abort]).
> * Different failure modes with a common root cause (i.e. the Android
> mochitest-gl memory corruption issues we're currently hitting or the Windows
> plugin tests hangs or the current spate of Marionette e10s crashes) leaving
> insufficient room in the bug summary to encapsulate everything in a usable
> way.
> 
> Just in the past week, there have been multiple requests asking the sheriff
> team to manually star various failures under one main tracking bug, which is
> failure-prone and highly-likely to lead to inconsistent results. The path to
> consistent results is taking the path of least resistance - which means
> making Treeherder generate better bug suggestions in these scenarios.
> 
> My strawman proposal:
> * Add a new meta-orange bug keyword (feel free to bikeshed the name to your
> heart's content).
> * If intermittent-failure bug XXX is marked as blocking meta-orange bug YYY,
> have Treeherder suggest bug YYY in lieu of bug XXX.

I like the proposal.



> 
> Pros:
> * This should Just Work with OrangeFactor. All the related issues will now
> becoming into OrangeFactor with one common bug number and they'll appear in
> the right place in the list as one aggregated "failure".
> * This is easy to implement on the Bugzilla side.
> 
> Cons:
> * It still relies on somebody noticing the common underlying problem and
> marking the dependency where applicable.
> * It's Yet Another Layer of Complexity on top of the existing infra when the
> long-term goal is to abstract a lot of this off bug summaries and such (my
> understanding at least).
> * Unclear how easily this could be implemented in Treeherder. How much of
> the current workflow is baked into it and how deeply?
> * The basic proposal doesn't do anything for dependencies marked after the
> fact (i.e. failures previously annotated as bug XXX will forever show up
> that way in OrangeFactor even if newer ones go under bug YYY).
> 
> Other considerations:
> * Ideally the Bug Filer will allow the setting of dependencies at the time
> of bug creation so even newly-filed bugs can be correctly annotated from the
> start. Though this means we'll be filing a bunch of "dummy" bugs which serve
> no purpose other than to create a link between a given failure line and the
> "real" bug we care about. Need to think about how the Treeherder Robot
> handles those.

:Kwierso, this seems easy enough to do for the bugfiler. Is this something we can add now?

> * We'd probably want bug XXX to be available for starring *somewhere*, maybe
> under the "Show / Hide more" menu. More generally, we should probably put
> some thought into how to present this in the Treeherder UI.


Jonathan,

Can you add this to the treeherder project agenda so we can see how much work this is and where we can put it in the priority.


> 
> In summary, I think the proposal outlined here gives a reasonable path
> forward Right Now that can help mitigate the current pain point while not
> requiring a ton of eventually-obsolete work to get stood up. I'd love to
> hear feedback from other sheriffs, TH developers, and other stakeholders!
Flags: needinfo?(jgriffin)
Flags: needinfo?(wkocher)
> :Kwierso, this seems easy enough to do for the bugfiler. Is this something we can add now?

I have a WIP patch to set bug dependencies from the filer. Main issue I'm hitting is that the "create" bug REST API doesn't have access to all of the same fields that creating bugs from the UI does, so anything outside of the list in http://bugzilla.readthedocs.io/en/latest/api/core/v1/bug.html#create-bug gets more complicated. 

I guess it wouldn't be too horrible if the server-side component of the bugfiler can create the bug with whatever fields it knows will work, and then immediately do another api call to update the bug with the rest of the fields before passing back the bug ID to the UI-side?
Flags: needinfo?(wkocher)
Just to check I've understood correctly / to summarise comment 0 a bit:
* For a variety of reasons, sometimes the bug summary isn't long enough to hold all of the strings that are required for that bug to be suggested in all the cases that are desired.
* The current workaround is to have multiple bugs, to between them effectively give a larger bug summary.
* However this results in the failure stats being split across multiple bug numbers, so when looking at eg OrangeFactor it's harder to tell what should be prioritised.
* As such, the proposal is to use bug dependencies and/or custom keyword/whiteboard/... annotations, so that Treeherder can fake the bug number reported to OrangeFactor, causing all of the failures from those bugs to be instead reported as one meta-bug number.

Could you give some example bugs, so I can see some specific cases?

We discussed this issue in the Treeherder meeting yesterday.

Between that and thinking about this some more today, I've thought of a few different ways this issue could be addressed:

0) The proposal above:
  * Pros:
    - Doesn't require Bugzilla code changes.
  * Cons:
    - May result in confusing UX since the bug suggestions vs OrangeFactor vs what the bugs say will differ. (Given Treeherder will have to lie to OrangeFactor.)
    - Requires Treeherder code changes, which will be thrown away once we move away from bugs being the canonical source of information.

1) Switching to a field other than the bug summary, which allows for more characters (like the crash signature field). To avoid having to adjust existing bugs, if this field is empty, the behaviour would be to fall back to the bug summary.
  * Pros:
    - Avoids having to create multiple bugs and then join them together, which is a hack at best.
    - People have wanted to do this for cleaner summaries in general.
    - Parity with crash signature field.
  * Cons:
    - Requires both Treeherder and Bugzilla code changes (though the Treeherder changes are simpler than for #0), which will be thrown away once we move away from bugs being the canonical source of information.

2) Reduce the need for long bug summaries in the first place, by making improvements to the log output and/or Treeherder bug suggestion logic. ie: for the `[@ mozalloc_abort]` example in comment 0 - that failure line shouldn't even be the one used for the bug summary. There should be a MOZ_ABORT: (or whatever) line prior that should be unique enough.
  * Pros:
    - Improvements will benefit both the current system and the new autoclassify/bugs-not-canonical-source world.
    - This can always be done in addition to the other options.
  * Cons:
    - Requires someone to look at the logs for the chronic cases and file some bugs against test suites/harnesses, but the set of people who can do this is larger than those required to implement/review the other options.

3) Make changes to OrangeFactor to group bugs together if they have a certain whiteboard string.
  * Pros:
    - Doesn't require any Bugzilla/Treeherder code changes (aka hacks).
    - Clearer UX, since the user can still see the individual bugs rather than just the fake meta-bug.
    - Can likely be implemented client-side pretty easily, since the API already returns the bug whiteboard values.
  * Cons:
    - OrangeFactor is EOL.

4) Use the existing OrangeFactor bugzilla quicksearch functionality to show the totals of these bugs. The quicksearch syntax could use component name / blocking bug number / summary substring / keyword value etc.
  * Pros:
    - No need to wait for a feature to be added - it already exists!
  * Cons:
    - People will need to either enter the quicksearch syntax themselves, or use eg bookmarked links on a Wiki page. (The OrangeFactor UI could have a few examples/tips added to make this easier, or could link to a Wiki page.)

I think regardless of any other decision, time should be spent on #2, since it's really the root cause, and will benefit:
* humans reading the raw logs
* humans reading the failure line summaries in Treeherder
* autoclassify / the bugs-not-canonical-source future world

After that, my recommendation would be for #3/#4. Any of the others require developer (and/or reviewer) time from people working on Treeherder features that will be replacing any code written here.

Thoughts? :-)
Flags: needinfo?(jgriffin)
The things that precipitated the current round of bitching at us for doing things wrong (besides "the existence of bugfiler", which made it easier to file another bug than to cram a summary with filenames until it fills up and then clone it to another one and cram that) were Windows plugin hangs (some unsearchable combination of "ALL 330 seconds :plug-ins" because a bunch have been marked as duplicates though we still star as them, and a bunch of browser-chrome failures that are mostly filed in Firefox::General); the dependents of bug 1300355; every web-platform-tests timeout bug I filed in August unless I happened to file a real one amidst all the e10s debug ones; the utterly unsearchable bug 1285531 which is not only jemalloc crashes but also GC crashes and not only webgl tests but also any reftest which runs in a chunk that has previously run webgl tests; bug 1294009 where I've just given up on making it starrable for anyone else because there's another few hundred possible filenames.
3 and 4 won't stop the complaints, because one of the complaints is "as component owner, I do not want to have 200 opens bugs in my component when I have said that all 200 are a single issue."
And bug 1223198, which now affects a large proportion of Linux-based reftests that have migrated to Taskcluster. And bug 1051567 (see deps) for a spate of different-but-really-the-same Mn-e10s issues. And bug 1295977.
(In reply to Ed Morley [:emorley] from comment #3)
> for the `[@ mozalloc_abort]` example in comment 0 - that failure line
> shouldn't even be the one used for the bug summary. There should be a
> MOZ_ABORT: (or whatever) line prior that should be unique enough.

This is bug 1282522.
David asked today what the status of this bug was.

A few items from comment 3 are still outstanding. Could someone:

1) Confirm the "check I've understood correctly" bullets are correct
2) Provide some more bug examples, if this is ongoing
3) Give their thoughts on my assessment of the possible solutions 

That said, I think this part of comment 3 is still relevant:

(In reply to Ed Morley [:emorley] from comment #3)
> I think regardless of any other decision, time should be spent on #2, since
> it's really the root cause, and will benefit:
> * humans reading the raw logs
> * humans reading the failure line summaries in Treeherder
> * autoclassify / the bugs-not-canonical-source future world

...and this is something that anyone can do in the meantime, not just the Treeherder team (especially now we're just 2 people).
Flags: needinfo?(wkocher)
Flags: needinfo?(ryanvm)
Meant to add:

The longer term fix here is to not track individual bugs at all, but have the crash-stats like model, where there are lots of failure signatures - and the top N then have bugs filed and linked to those signatures (in a one-to-many mapping, giving you effectively meta bugs). 

The pre-requisites for that work has partly been completed as part of the auto-classify feature, though more still to do (and lots of complications like not all jobs supporting structured logs etc).
(In reply to Ed Morley [:emorley] from comment #8)
> 1) Confirm the "check I've understood correctly" bullets are correct
>> * For a variety of reasons, sometimes the bug summary isn't long enough to hold all of the strings that are required for that bug to be suggested in all the cases that are desired.

Yeah, I guess you could summarize it that way since an infinitely-long summary field would work around this problem.

>> * The current workaround is to have multiple bugs, to between them effectively give a larger bug summary.

Correct

>> * However this results in the failure stats being split across multiple bug numbers, so when looking at eg OrangeFactor it's harder to tell what should be prioritised.

Exactly

>> * As such, the proposal is to use bug dependencies and/or custom keyword/whiteboard/... annotations, so that Treeherder can fake the bug number reported to OrangeFactor, causing all of the failures from those bugs to be instead reported as one meta-bug number.

Correct

> 2) Provide some more bug examples, if this is ongoing

Bug 1300355 is a high-frequency example of this at the moment. It's also an interesting case where the same underlying problem manifests in a few different failure modes (timeouts, crashes, or drawing issues) across a wide variety of tests.

> 3) Give their thoughts on my assessment of the possible solutions 
> 
> That said, I think this part of comment 3 is still relevant:
> 
> (In reply to Ed Morley [:emorley] from comment #3)
> > I think regardless of any other decision, time should be spent on #2, since
> > it's really the root cause, and will benefit:
> > * humans reading the raw logs
> > * humans reading the failure line summaries in Treeherder
> > * autoclassify / the bugs-not-canonical-source future world
> 
> ...and this is something that anyone can do in the meantime, not just the
> Treeherder team (especially now we're just 2 people).

How would that handle bug 1300355 for example? I'm worried this is going to go down a rat-hole of edge cases. My proposal makes it a lot easier to substitute human judgment where warranted. OTOH, I'd love to see *any* progress at this point, so I'll take what I can get.
Flags: needinfo?(ryanvm)
I also don't see how we could have adjusted the reftest output to handle bug 1223198, as another example. One had to actually look at the failure screenshot to see that it was related to widget drawing.
Talos Windows 8 crashes may be an interesting example. We have 15 or so bugs open for the same issue - one for each subtest. These crashes do not produce valid minidumps, so all we can star against is "<test-name> | Found crashes after test run". I can imagine times/cases where we would want to track tp5o and tps crashes in separate bugs, but currently, all the crashes are the same, and it would be nice to track them all in one "* | Found crashes after test run" bug.

https://bugzilla.mozilla.org/buglist.cgi?quicksearch=Found%20crashes%20after%20test%20run&list_id=13525661
(In reply to Ryan VanderMeulen [:RyanVM] from comment #10)
> Bug 1300355 is a high-frequency example of this at the moment. It's also an
> interesting case where the same underlying problem manifests in a few
> different failure modes (timeouts, crashes, or drawing issues) across a wide
> variety of tests.
> ... 
> How would that handle bug 1300355 for example? I'm worried this is going to
> go down a rat-hole of edge cases. My proposal makes it a lot easier to
> substitute human judgment where warranted. OTOH, I'd love to see *any*
> progress at this point, so I'll take what I can get.

Wow a lot of bugs in that dependency tree! I can see why this is painful.

My concern with any "link X bugs together after the fact" approach is that it still spams people with numerous bugs (as philor mentioned in comment 5). If the failure summary doesn't suggest the overall meta bug in the first place (which it can't, with the current test filename based matching), then sheriffs/users are going to take the easy way out and file a new bug using the bug filer, which requires manual fixing up after.

Really what is needed is for the ability to suggest the bug in the first place, even if it's not an exact match for test name - ie: fuzzy matching, which is bug 1268484.

In the meantime, perhaps (1) from comment 3 is the best solution?

James, in a world where bugs are not the canonical source of information for intermittent failures, and we only file bugs for a subset of them, how do you see us mapping from the bugs to the failure signatures in Treeherder? Would a new "intermittent failure signature" field in Bugzilla (like the crash signature field used by crash-stats) be useful, or would we store the bug number in Treeherder instead? It's just if it would be useful, then doing (1) from comment 3 would actually be more viable, since it wouldn't just be wasted effort that gets thrown away later.
Flags: needinfo?(james)
There may still be additional classes of failures we can special-case during error summary generation to help short term (before we have fuzzy matching).

For example, we already extract the crash signature and search for that independent of the test filename:
https://github.com/mozilla/treeherder/blob/ae74d14c51554772a61d194457d9e098ec30541f/treeherder/model/error_summary.py#L63-L67

...if there are other common attributes that can be extracted, let us know :-)
Summary: Come up with a better way to classify "meta-failures" on Treeherder → Improve the classification of intermittent failures that aren't associated with just one test
The way things work with the failure_line / classified_failure data (i.e. the "autoclassification" model in treeherder) is that we have a many to one mapping of failure lines to classified failures (and currently a 1:1 mapping of classified failures to bugs). So in theory it's possible to have multiple distinct signatures mapping to the same classified failure because we aren't using the bug summary as the "signature" any more, but all previous matches.

Whether or not this is working out well in practice I'm not yet sure.
Flags: needinfo?(james)
(In reply to James Graham [:jgraham] from comment #15)
> So in theory it's possible to have multiple
> distinct signatures mapping to the same classified failure because we aren't
> using the bug summary as the "signature" any more, but all previous matches.

I wasn't meaning the mapping of failure lines to classified failures. Instead: In a future world, how do we see the mapping of classified failures to bugs taking place?
Component: Treeherder → Treeherder: Log Parsing & Classification
This bug came up again in bug 1392106.

The tl;dr here for now is:
* the future world of "bugs aren't the canonical source of truth" plus fuzzy matching is what we should be aiming for long term
* OrangeFactor being deprecated (bug 1367362 / bug 1367364) makes possible medium term solutions here more viable
* regardless of that, garbage in means garbage out -> the log output should be improved where possible (and yes it's a long tail, but pretty unavoidable)
Another idea that I think is worth considering here is a way to _annotate bugs_ as suggestions for certain categories of failures.  For example, if bug 1392106 could be annotated in a way that would cause treeherder to suggest it for all Windows 7 reftest failures within a certain range of pixel error (e.g., between 20 and 500 pixels that are different) with a note that shows up in treeherder that says that the suggestion may apply if the screenshot difference is a letter or small number of letters that go missing, then I'd think that could lead to the correct bug being starred.  Put another way, this suggestion really has two parts:  (a) an ability to add information that appears as part of a suggestion other than the bug summary, i.e., instructions for when the bug is the right choice, and (b) a syntax for causing a bug to match failures based on something other than string matching on errors in the log or for only a subset of the cases where the string matches.

Triaging: This bug is too old and too general. Much of what is mentioned is no longer in use.

Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → INVALID

It seems like the problems described here are largely still present, though. Or has this improved in ways that I'm not aware of?

Status: RESOLVED → REOPENED
Resolution: INVALID → ---
You need to log in before you can comment on or make changes to this bug.