Open Bug 1700970 Opened 3 years ago Updated 2 years ago

Perfherder compare Retrigger Jobs button no longer retriggers jobs

Categories

(Tree Management :: Perfherder, defect, P3)

Tracking

(Not tracked)

People

(Reporter: acreskey, Unassigned)

Details

From a perfherder compare view, e.g. here, the "Retrigger jobs" button no longer retrigger's jobs.

When inspecting the rt task in treeherder, I see this error
[task 2021-03-25T14:08:57.914Z] No need to rerun perftest-windows-perfstats: state 'completed' not in ('exception', 'failed')!

Similarly, from the Treeherder view, selecting 'r' for retrigger on a job fails, as does "Custom Action" retrigger.

This was discussed in #perftest on Matrix:

sparky I've seen this optimization before, but it seems to be affecting more of our tools now
ahal maybe it's coming from taskcluster?
sparky ahal: I think that's coming from the m-c side: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#225
ahal weird, I've never run into that before
ahal but it's apparently been there since 2019
sparky huh that's odd, I don't remember seeing before 2020
ahal I think we can probably remove that check
sparky ahal: That would be great, want me to make a patch for that?
ahal actually it's not that straight forward
ahal "rerun" is different from "retrigger"
ahal (I think rerun is usually when a task has an exception (turns blue) and we run it again)
ahal so in that context the guard makes sense
ahal but the "retrigger-multiple" action calls it for some subset of tasks:
ahal https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#312
ahal the regular "retrigger" action does not call it (so I guess it only happens when that retrigger-multiple action gets used)
ahal (I'm unsure what buttons in the treeherder/perfherder UI call what actions)
sparky Oh I see, the treeherder UI uses the retrigger-multiple one from what I saw
ahal maybe we just need to disable that check when calling it from retrigger-multiple
ahal also looks like adding a retrigger: true value to your tasks would fix it: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#31
ahal I don't know the pros / cons of using retrigger vs rerun though
ahal ¯_(ツ)_/¯
sparky We generally only need the retrigger one, I haven't seen a use case for the rerun one. Maybe I'll add some code in the action then to prevent reruns on performance tests so we don't have this issue
ahal I'd prefer to keep logic for specific types of tasks out of the actions if at all possible
ahal in this case I think we could just ignore that check wholesale if it's coming from retrigger-multiple
ahal (if someone retriggers any type of task manually, they likely don't want to hit that error)
ahal i.e, you can add a strict=True flag to that _rerun method and set it to False from within retrigger-multiple
sclements There hasn't been a very recent change to retriggers specifically, but we did recently upgrade the taskcluster package. It didn't sound like there'd be any breaking changes.
sclements I have some meetings but I can look into this more after.
sparky sclements: I don't think this is a treeherder issue, it seems to be related to some taskcluster code that we'd need to fix
sclements ok

:acreskey is this only affecting mozperftest or are other frameworks impacted?

Flags: needinfo?(acreskey)

Rerun is explicitly for release tasks, and increments the run # in a given taskId (task t run 0 -> rerun -> task t run 1). Because rerunning green tasks can completely break release graphs and cause multiple days of manual cleanup to get nightlies and releases green again, we have put reruns for non-failed and non-exception tasks behind a force option.

Retrigger is generally for non-release tasks. It will copy the task definition and bump timestamps, and run (task t1 run 0 -> retrigger -> task t2 run 0)

We probably want retriggers here. Please be careful about changing or removing checks around rerun, though, as this can reenable footguns around release tasks.

(In reply to Dave Hunt [:davehunt] [he/him] ⌚GMT from comment #1)

:acreskey is this only affecting mozperftest or are other frameworks impacted?

As far as I can tell, this is only affecting mozperftest.
I did a test of raptor-browsertime and the retrigger appears to work there:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=ccefc8c0d22b7f1f308147f5ca1df525a908ce82&selectedTaskRun=CGXvq2AATzqK5iKfFvLEJA.0

Flags: needinfo?(acreskey)
No longer blocks: 1754831
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.