Closed Bug 1197258 Opened 10 years ago Closed 10 years ago

Job collapsing makes it difficult to tell at a glance whether a failure is intermittent or not

Categories

(Tree Management :: Treeherder, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Assigned: camd)

References

Details

Attachments

(1 file)

47 bytes, text/x-github-pull-request
emorley
: review+
Details | Review
On try there is an auto-retrigger service that retriggers any failures. This makes it easy for developers to see if failures are intermittent, or permanent. But with collapsing, green retriggers get hidden, so it's no longer obvious at a glance. This isn't a huge deal, because you can still expand the job to get this information, it's more of a nice to have. Also, when we have auto-starring, this won't matter as much either. So maybe it isn't worth worrying about. If it's easy, one way to fix this is to also not collapse jobs of the same type when there is a failure. E.g uncollapse all mochitest-9 jobs even when only one of them failed.
Blocks: 1163064
Priority: -- → P2
FWIW, this issue confused me into thinking my retriggers weren't running recently. It'd be much nicer if retriggers weren't collpased.
Developers many times look to make sure that their orange jobs on try are *not* intermittent jobs by re-triggering it. If the retriggered job comes back green, the developer will not be able to see that it was indeed an intermittent issue since the green job gets grouped under the "+Number" symbol. Can we please always show re-triggered jobs? seth> KWierso: i can't seem to retrigger jobs anymore * tessarakt (jens@moz-u3gij6.dyn.telefonica.de) has joined #developers <KWierso> seth: hrm <seth> it says "retrigger request sent" but it never confirms it <seth> and the retriggered job never runs <KWierso> adusca/armenzg: is ^ you? <pulsebot> Check-in: https://hg.mozilla.org/integration/fx-team/rev/375a22f7bc4e - David Critchley - Bug 1214590 - Remove Loop default Room name, r=dmose * paolo_ has quit (Quit: ) <seth> KWierso: can you retrigger the linux x64 debug bc7 failure on this try job? https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed1d9e36a2e7 <armenzg> seth, are you retriggering individual jobs on a try push? <seth> armenzg: yeah * aselagea|buildduty is now known as aselagea|afk <seth> that one ^ <armenzg> KWierso, pulse_actions does *not* handle individual retriggers <armenzg> only sheriffs' backfill and so requests <armenzg> seth, is this the job you retriggered? https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/a31c6f53f11e <armenzg> (FYI you can get to this page from the Treeherder page) * Honza has quit (Connection closed) <armenzg> 3 minutes ago <seth> armenzg: hmm, yes. * Honza (chatzilla@moz-i5l.7ku.62.176.IP) has joined #developers <seth> armenzg: those runs aren't showing on treeherder <armenzg> seth, remember that there is a lag <seth> i retriggered them yesterday! * bdahl has quit (Ping timeout: 121 seconds) * vporof is now known as victorporof <seth> if the lag's that bad, there's a problem <KWierso> seth: all of the retrigger requests seem to be going through :\ <armenzg> yes, you're right * fracting (fracting@moz-etr.siu.9.119.IP) has joined #developers <seth> oh jeez <seth> i'm sorry guys <KWierso> seth: I'm seeing a bunch https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed1d9e36a2e7&group_state=expanded&filter-searchStr=bc7 * stephend|mtg is now known as stephend|lunch <seth> actually i just realized that because all of the retriggers succeeded, they are hidden behind the "+30" link <armenzg> KWierso, he's saying it's been pending since yesterday <gcp> cpeterson: ping <armenzg> seth, yes, that is a UX problem <seth> no, they actually ran, i just do not know how to use treeherder in this brave new world
Cameron, would you mind taking a look at this at some point, since you're more familiar with this code? Andrew's suggestion of ~"if there are a mixture of results for the same job type, show them all separately" seems sensible I think? (ie: if there are some green some orange, don't collapse them. Similarly, if there are some pending or running and some completed, don't collapse them etc)
Flags: needinfo?(cdawson)
(In reply to Ed Morley [:emorley] from comment #4) > Cameron, would you mind taking a look at this at some point, since you're > more familiar with this code? > > Andrew's suggestion of ~"if there are a mixture of results for the same job > type, show them all separately" seems sensible I think? (ie: if there are > some green some orange, don't collapse them. Similarly, if there are some > pending or running and some completed, don't collapse them etc) This wouldn't address comments 2 and 3. We need to have a way to show that multiple jobs with the same symbol are present (even if they have the same status).
I don't think it's important to show re-triggers unless one of the re-triggers (or the original job) was a failure. My understanding is that this is what comment 2 and 3 are asking for.
(In reply to Andrew Halberstadt [:ahal] from comment #6) > I don't think it's important to show re-triggers unless one of the > re-triggers (or the original job) was a failure. My understanding is that > this is what comment 2 and 3 are asking for. The conversation in comment 3 indicates that all the retriggers were successful, but it was not obvious from the UI they had run at all.
I'll take another look at this tomorrow, too. But keep in mind that you can globally disable the collapsing by clicking the "(+)" button. This adds a query param to the URL, so it will stay that way even after refresh. Not sure that's sufficient.
personally I use the + a lot, but that doesn't mean most people know about it. Currently we have trigger-bot which retriggers failures on try automatically so a developer comes back a few hours later and can determine if there is a real failure. Ideally they will see 2 orange/red identical jobs and know it is a failure, but without being sure it ran a second time green, they have to do more work. We want to make this a single glance; Of course dealing with many retriggers start to make it look messy. I am not sure of the best approach here. Probably for all non green jobs (which we display by default) display all other matching jobs outside of the collapse.
(In reply to Chris Manchester [:chmanchester] from comment #7) > The conversation in comment 3 indicates that all the retriggers were > successful, but it was not obvious from the UI they had run at all. But I think one of the reasons for that was that they didn't show up whilst pending/running. If people could at least see that their retrigger had had an effect, then they'd be more inclined to go looking for the completed job, were it still succeeding. Alternatively, perhaps the "30" needs to be "show more" or ellipses or similar - maybe it's just not the right choice of label.
Doing some discussion here at the work-week. I think the approach we'll take will be that any jobs that are "the same" other than result/state, we won't collapse those jobs. The exception will be "retries" (as opposed to "retriggers"). Retries will still collapse like they do now.
Flags: needinfo?(cdawson)
Assignee: nobody → cdawson
Attached file PR
Sorry to keep using your for all my reviews. At least this one is short and easy, I think.
Attachment #8685685 - Flags: review?(emorley)
This solution will prevent duplicate job types within a group from being collapsed into counts.
Status: NEW → ASSIGNED
Comment on attachment 8685685 [details] [review] PR r+; have left a comment :-)
Attachment #8685685 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/ab765405bc88a0d83cb5d0eb829c7b8576787348 Bug 1197258 - Don't collapse retriggered jobs If the same job symbol is detected within the same group, don’t collapse them to counts. They are almost surely retriggers and will be shown to make intermittent failures more obvious.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: