Open Bug 1683233 Opened 4 years ago Updated 3 years ago

failed tasks should go away or change color if they've been retriggered

Categories

(Tree Management :: Treeherder, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: justdave, Unassigned)

Details

Example: https://treeherder.mozilla.org/jobs?repo=comm-beta&revision=a429d83e5e173f980ebb1ef2172b5793d1aa8666

There's a lot of red in there and it looks scary. But almost every one of those red tasks got retriggered and subsequently passed. While it's nice to know that it failed initially and had to be re-run, it's really hard to tell which ones passed without reading really closely "does this red one have a green one with the same task code right after it?"

if it changed color or something then I'd know "this isn't the final run of this one, so if I'm only caring about the end state right now, I can ignore it".

Thanks for filing the bug. I wonder if part of this confusion might have to do with how retriggers are determined at the taskcluster level (should the same task/job be retried or should be it be rerun as a new task (with new task id) entirely). I know this UI is confusing but we've been focusing primarily on making UI improvements to Push Health with the goal to make this the developer view and leave Treeherder as primarily the sheriffs view. The way jobs are displayed in Push Health is a bit different so I'd be curious to know if you still feel the same level of confusion regarding retriggered jobs and color usage with Push Health. If you have the time to take a look, this would be useful feedback (it's still in-progress though, with an upcoming sprint to continue work on it).

I can't see anything at all on Push Health, I get an error "There was a problem retrieving push metrics:" in a red box, and the rest of the page is blank.

I'm primarily using this as a sheriff role (I'm a release engineer), so being able to see which tests failed (and everyone's not just my own) is important.

I found a dropdown for switching from "try" to "all" and with it on "all" I can see a few of my pushes to comm-esr78. It looks really pretty. As I said above, I still want my initial request for Treeherder because I'm a release engineer and need to see it all... :-) But if I were looking at this as a developer this indeed makes it quick and easy to understand. Except that I click on the Tests to expand it and just get an error "There was a problem retrieving the data. Please try again in a minute." I assume that's part of the still in progress bit. :-)

Thanks for the feedback. Yes, I still have some work to do there. :) Regarding the error with the expanded tests, did that happen with just one push or all of the pushes?

Regarding your original request for the jobs view, I'm wondering if some basic formatting changes would help - more obviously grouping the same tasks together rather than them grouped but all lumped together? I'd be hesitant to change the colors because I can see where that might be an issue with regards to tracking intermittent failures. I think getting input from the code sheriffs here would also be helpful.

Sheriffs, do you have any opinions about changing how tasks are displayed (colors or formatting)?

Flags: needinfo?(sheriffs)
Flags: needinfo?(aryx.bugmail)

To check if every task reran successfully, Taskcluster's task group view should be used. The task id is the one of the cron (for releases) or decision tasks (for the tasks scheduled on push). Be warned the page can slow down your browser. The Firefox release managers usually open it, paste the task id, wait for the page to load the info about all tasks and close it.

For tasks not part of the release graphs, Push Health or the following workflow from sheriffs could help:

  1. Classify the tasks which are known intermittents.
  2. If in doubt if an issue is caused by the current push, retrigger the task and leave it unclassified.
  3. When all tasks for the push are complete, check if any unclassified failures also failed its retrigger.

Regarding a view which hides failed tasks if there is a successful retrigger/rerun: It implies every failed task should run again, the above two procedures should help with that and colors could remain like they are.

more obviously grouping the same tasks together rather than them grouped but all lumped together?

Sounds like bug 1421365.

Flags: needinfo?(sheriffs)
Flags: needinfo?(aryx.bugmail)
You need to log in before you can comment on or make changes to this bug.