Prevent retriggering release tasks from treeherder

RESOLVED FIXED in Firefox -esr60

Status

task
RESOLVED FIXED
Last year
4 days ago

People

(Reporter: tomprince, Assigned: tomprince)

Tracking

(Blocks 1 bug)

3 Branch
mozilla68
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 fixed, firefox67 fixed, firefox68 fixed)

Details

()

Attachments

(5 attachments)

Assignee

Description

Last year
No description provided.
Assignee

Updated

Last year
Summary: Prevent retriggering release tasks. → Prevent retriggering release tasks from treeherder
Assignee

Comment 1

Last year
During 59.0.2, both both ryan and aki reran the same repackage task, then ryan cancelled his (latest). aki had to cancel the repackage-signing task, then rerun both in order. this could have forced us to do a build2 if it wasn't caught in time. we should consider moving off treeherder and into a releaseduty dashboard, and/or going with stricter handoffs

Given that treeherder doesn't show re-runs clearly, it was suggested either removing the release tasks from treeherder, or at least removing the ability to retrigger them via treeherder. This can probably be done by specifying the action context (https://firefox-source-docs.mozilla.org/taskcluster/taskcluster/actions.html#setting-the-action-context)
@ryanvm, @dburns - any concerns with removing mutable actions from release specific tasks within Treeherder? Or even removing the tasks themselves from view? Just removing mutable actions (retrigger) would likely be the preference for us so that we keep some release transparency on TH
Flags: needinfo?(ryanvm)
Flags: needinfo?(dburns)
Is anybody going to notice when these jobs fail on Nightlies if they aren't visible on TH?
Flags: needinfo?(ryanvm)
I think removing them from visibility on Treeherder would be a mistake. It also feels to me like comment 1 is throwing the baby out with the bath water. Ultimately, I've seen two ways the current situation can bite us:
1) Retriggering a job when a rerun was intended. This obviously makes for a lot of downstream CoT issues and should be much more difficult to do than it currently is.
2) Rerunning a job that was already done so and since completed successfully (the situation I hit with Aki in comment 1).

I would suggest making rerun a no-op if an existing task ID has already completed successfully. I can't see any good reason to rerun a task with the same ID except under failure conditions. That would have avoided situation #2. Just finishing with a status of "completed" would suffice?

Situation #1 could be avoided by changing the default action from retrigger to rerun on jobs where there is sensitivity to CoT. Note that this would be a beneficial change in general since there are CI jobs which depend on signed builds (xpcshell) which currently aren't run on pushes where a failed build was retriggered because the subsequent signing step failed.

In other words, I think a lot of these issues can be resolved by having some safer/saner defaults rather than making things more restrictive and less transparent. Not to mention it'd be signing up the releaseduty folks to monitor more than they currently are since as I mentioned previously, I'm pretty sure than the sheriffs are currently the only people who'll notice when a Nightly release promotion job fails.
There are a couple issues here:

Retriggers:
Preventing retriggers by disabling the retrigger action on [certain?] relpro action tasks would possibly help avoid the sheriffs-retriggering-jobs scenario. It also wouldn't affect nightlies until we move to shippable-builds with nightly promotion, and we could potentially send a flag to keep retriggers available if wanted (probably not; I think we want reruns over retriggers here as well). If retriggers are disabled, that removes one issue with release tasks on treeherder.

Reruns:
This could be similar to the above: nightly promotion could show on treeherder, and release promotion could be on its own dashboard.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #4)
> I would suggest making rerun a no-op if an existing task ID has already
> completed successfully. I can't see any good reason to rerun a task with the
> same ID except under failure conditions. That would have avoided situation
> #2. Just finishing with a status of "completed" would suffice?

Sometimes we want to rerun when it's completed. Possibly a --force ? Or maybe we need to determine when a rerun-on-completion is useful or needed, and weigh those against the danger of rerunning a completed task when that's not the desired behavior.

> Situation #1 could be avoided by changing the default action from retrigger
> to rerun on jobs where there is sensitivity to CoT. Note that this would be
> a beneficial change in general since there are CI jobs which depend on
> signed builds (xpcshell) which currently aren't run on pushes where a failed
> build was retriggered because the subsequent signing step failed.

I think the above disable-retriggers-on-[certain?]-relpro-graphs solves most of this. There may be an issue if we accidentally retrigger the action task itself.

> In other words, I think a lot of these issues can be resolved by having some
> safer/saner defaults rather than making things more restrictive and less
> transparent. Not to mention it'd be signing up the releaseduty folks to
> monitor more than they currently are since as I mentioned previously, I'm
> pretty sure than the sheriffs are currently the only people who'll notice
> when a Nightly release promotion job fails.

I think this makes sense. We could also address the issue of sheriffs being the only ones noticing if a nightly release promotion task failing.
(In reply to Aki Sasaki [:aki] from comment #5)
> Retriggers:
> Preventing retriggers by disabling the retrigger action on [certain?] relpro
> action tasks would possibly help avoid the sheriffs-retriggering-jobs
> scenario. It also wouldn't affect nightlies until we move to
> shippable-builds with nightly promotion, and we could potentially send a
> flag to keep retriggers available if wanted (probably not; I think we want
> reruns over retriggers here as well). If retriggers are disabled, that
> removes one issue with release tasks on treeherder.

(thanks to Tom for the idea; it's a good one.)
(In reply to Aki Sasaki [:aki] from comment #5)
> Sometimes we want to rerun when it's completed. Possibly a --force ? Or
> maybe we need to determine when a rerun-on-completion is useful or needed,
> and weigh those against the danger of rerunning a completed task when that's
> not the desired behavior.

Having a --force option makes sense for those situations. From my POV, there's no issue with someone making an informed decision to do the non-standard thing, but that would at least require them to think about it first. And in my case, seeing that it had already run successfully would have stopped me from going any further with things and avoided scenario #1.

> I think the above disable-retriggers-on-[certain?]-relpro-graphs solves most
> of this. There may be an issue if we accidentally retrigger the action task
> itself.

I'm a bit confused with how this proposal works with regular builds in CI. Does disabling retriggers mean rerun will happen by default instead? Basically, I'm worried about a regular Windows build hitting an infra failure (say bug 1444168 as has frequently been the case lately), being retriggered, than having the Bs job fail because it was a retrigger and therefore cause downstream test jobs to not run (when a rerun on the failed build in the first place would have allowed everything to proceed without issue).

One more data point: After my NI in this bug, I observed a couple random l10n jobs fail during the 60.0b8 release promotion process, both within a minute of having done so. Normally I would have rerun them immediately afterwards without saying anything. End of story, everyone's happy.

Instead I waited to see when they would be noticed and rerun by someone on release duty without any prompting. They were rerun 23 and 16 min later respectively. I'm not saying this to cast aspersions on our release duty people, just pointing out that everyone gets busy and having more capable eyes on these things allows for a better level of service than we will otherwise get.

To summarize, these proposals largely mean less work for sheriffs and more for releaseduty, so ultimately, I'd say to do whatever you feel makes sense and just make sure it's well communicated so that procedures can be updated accordingly. I just tend to err on the side of sharing the load where possible and this feels like something where that can be reasonably accomplished.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #7)
> Having a --force option makes sense for those situations.

Let's talk to the tc team to see if this is doable? It may require an API and cli tweak, so it may not necessarily be quick.

> > I think the above disable-retriggers-on-[certain?]-relpro-graphs solves most
> > of this. There may be an issue if we accidentally retrigger the action task
> > itself.
> 
> I'm a bit confused with how this proposal works with regular builds in CI.

This disable-retrigger-action would be graph-specific. So the on-push, "regular builds in CI", would have a retrigger action enabled. Nothing changes there. For [certain?] release graphs, the actions.json would eliminate the retrigger action, and potentially the "add-tasks" action and others.

However, now that I think about it, treeherder may still go to the on-push decision task for actions.json, even if the task in question is not in the on-push graph. We'd have to investigate to see if this would actually solve the problem.

> Does disabling retriggers mean rerun will happen by default instead?
> Basically, I'm worried about a regular Windows build hitting an infra
> failure (say bug 1444168 as has frequently been the case lately), being
> retriggered, than having the Bs job fail because it was a retrigger and
> therefore cause downstream test jobs to not run (when a rerun on the failed
> build in the first place would have allowed everything to proceed without
> issue).

I think this is bug 1420482.

> One more data point: After my NI in this bug, I observed a couple random
> l10n jobs fail during the 60.0b8 release promotion process, both within a
> minute of having done so. Normally I would have rerun them immediately
> afterwards without saying anything. End of story, everyone's happy.
> 
> Instead I waited to see when they would be noticed and rerun by someone on
> release duty without any prompting. They were rerun 23 and 16 min later
> respectively. I'm not saying this to cast aspersions on our release duty
> people, just pointing out that everyone gets busy and having more capable
> eyes on these things allows for a better level of service than we will
> otherwise get.

I believe the releaseduty dashboard proposal (https://github.com/mozilla-releng/releasewarrior-2.0/issues/114) would help solve this. It'll ping anyone with the dashboard loaded. It would be a shared view, so we don't have to worry about the treeherder failed task still showing. I'm certainly open to you having access to that dashboard if you want to watch for failed release tasks - this is more about the confusion of the failed task still showing as needing action on treeherder, even if the other dashboard(s) show it as having been acted upon.

> To summarize, these proposals largely mean less work for sheriffs and more
> for releaseduty, so ultimately, I'd say to do whatever you feel makes sense
> and just make sure it's well communicated so that procedures can be updated
> accordingly. I just tend to err on the side of sharing the load where
> possible and this feels like something where that can be reasonably
> accomplished.
See Also: → 1420482
@edmorely might be more appropriate to loop in than dburns. Just want to keep TH looped in
Flags: needinfo?(dburns) → needinfo?(emorley)
Depends on: 1450094
> Let's talk to the tc team to see if this is doable? It may require an API and cli tweak, so it may not necessarily be quick.

As long as this change is entirely within actions, this is pretty easy: just add a `force` property to the action schema, that defaults to false.  Then set `"force": true` when you want to force.

> However, now that I think about it, treeherder may still go to the on-push decision task for actions.json, even if the task in question is not in the on-push graph. We'd have to investigate to see if this would actually solve the problem.

The actions spec says to use the decision task's actions.json.  Within Taskcluster, the decision task for task X is the task with ID x.taskGroupId.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #10)
> > Let's talk to the tc team to see if this is doable? It may require an API and cli tweak, so it may not necessarily be quick.
> 
> As long as this change is entirely within actions, this is pretty easy: just
> add a `force` property to the action schema, that defaults to false.  Then
> set `"force": true` when you want to force.

We were talking about taskcluster-cli, which I think would require an API change - bug 1450094.

However, we could add an action to rerun, check the status, and check the `force` input bool if it's completed (or only allow failure/exception reruns in the action, and fall back to tc-cli to "force"). RyanVM how would you feel about that? Improvement? Prefer the cli?

If we add a rerun custom action, I believe the decision and action tasks would need rerun scopes, and each user who uses the rerun action as well.

> > However, now that I think about it, treeherder may still go to the on-push decision task for actions.json, even if the task in question is not in the on-push graph. We'd have to investigate to see if this would actually solve the problem.
> 
> The actions spec says to use the decision task's actions.json.  Within
> Taskcluster, the decision task for task X is the task with ID x.taskGroupId.

Cool. Sounds like if Treeherder uses the pushlog ID to find the taskGroupId, that would be a Treeherder bug.
(In reply to Aki Sasaki [:aki] from comment #11)
> However, we could add an action to rerun, check the status, and check the
> `force` input bool if it's completed (or only allow failure/exception reruns
> in the action, and fall back to tc-cli to "force"). RyanVM how would you
> feel about that? Improvement? Prefer the cli?

This is for retriggers from TH? That sounds fine to me (only allowing rerun on failure/exception runs and forcing via cli).
(In reply to Jordan Lund (:jlund) from comment #2)
> any concerns with removing mutable actions from release
> specific tasks within Treeherder? Or even removing the tasks themselves from
> view? Just removing mutable actions (retrigger) would likely be the
> preference for us so that we keep some release transparency on TH

Hi! I'm happy to defer to the Taskcluster team/sheriffs/release engineering on this. The Treeherder team doesn't have a huge amount of context here (eg the task scheduling feature in Treeherder was written by the Taskcluster team) and the choice of approach doesn't really affect us :-)
Flags: needinfo?(emorley)
Posted patch rerun actionSplinter Review
This is my current rerun action WIP.
It fails on trying to rerun completed tasks - yay! It also fails on rerunning failed/exception tasks - boo!

[task 2018-04-03T19:12:57.533465Z] HTTPError: 403 Client Error: Forbidden for url: http://taskcluster/queue/v1/task/dBpfbtV5Qfq42UenI34HYQ/rerun

https://tools.taskcluster.net/groups/UWwSJe4kQMOn9Kd0gx8i-Q/tasks/YSyAwCJ1TFuuAN3z9a5LHA/details

Is this because we only have scopes for queue:rerun-task:gecko-level-1/* ? We may need to add level-2 and level-3, for the appropriate roles.
(In reply to Aki Sasaki [:aki] from comment #15)
> Is this because we only have scopes for queue:rerun-task:gecko-level-1/* ?
> We may need to add level-2 and level-3, for the appropriate roles.

Did so; green!
Attachment #8964660 - Attachment description: wip - rerun action → rerun action
Attachment #8964660 - Flags: review?(bstack)
Comment on attachment 8964660 [details] [diff] [review]
rerun action

I don't think that status_task needs that testing mode path. We allow things to get from apis in testing mode but just mock out the bits where we would be creating tasks and such things. Checking the status of a task should be just fine. Otherwise looks great!
Attachment #8964660 - Flags: review?(bstack) → review+
Keywords: leave-open
We still need to address the retriggers, but a) we have reruns, and b) we can possibly default to reruns for non-tests in bug 1420482.
The leave-open keyword is there and there is no activity for 6 months.
:dustin, maybe it's time to close this bug?
Flags: needinfo?(dustin)
Per Aki's comment 19, looks like no?
Flags: needinfo?(dustin) → needinfo?(aki)

Comment 25

7 months ago
If the keyword is problematic, we could remove it; otherwise I'd lean towards keeping this open til we either resolve comment 19 in some manner.
Flags: needinfo?(aki)
Keywords: leave-open
Version: Version 3 → 3 Branch
Assignee

Updated

2 months ago
Duplicate of this bug: 1533940
Assignee

Updated

2 months ago
Blocks: 1535193
Assignee

Updated

2 months ago
Assignee: nobody → mozilla
Assignee

Comment 27

2 months ago

In order to prevent retriggers for release tasks, we will cause the retrigger
action to rerun instead, so move the cdoe togehter.

Assignee

Comment 28

2 months ago

Many tasks (release tasks and cached tasks, in particular) should be re-run rather
than retriggered. Instead, make the retrigger action re-run them instead.

Assignee

Comment 29

2 months ago

In Bug 1519599, treeherder switched to using add-new-jobs to retrigger jobs, since
there wasn't an action to retrigger multiple jobs. This prevents us from adding logic
to rerun some jobs instead of retriggering them.

This adds a new action that takes input like add-new-jobs, but that we can add logic
to handle rerun vs. retrigger in. Additionally, the input it takes is designed
to make Bug 1521032 easier to implement.

Assignee

Comment 30

2 months ago

Many tasks (release tasks and cached tasks, in particular) should be re-run rather
than retriggered. Instead, make the retrigger-multiple action re-run them instead.

Attachment #9057752 - Attachment description: Bug 1450012: [taskgraph] Make retrigger action re-run many tasks; r?dustin → Bug 1450012: [taskgraph] Disable retrigger action for many tasks; r?dustin

Comment 31

2 months ago
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/acfc18211ade
[taskgraph] Move rerun action to retrigger.py; r=dustin
https://hg.mozilla.org/integration/autoland/rev/fe831ef6691f
[taskgraph] Disable retrigger action for many tasks; r=dustin
https://hg.mozilla.org/integration/autoland/rev/43956eb12043
[taskgraph] Add an action to retirgger multiple tasks at once; r=dustin
https://hg.mozilla.org/integration/autoland/rev/c33e7bb2397c
[taskgraph] Make `retrigger-multiple` action selectively re-run tasks; r=dustin
Assignee

Updated

Last month
Status: NEW → RESOLVED
Closed: Last month
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Regressions: 1559071
You need to log in before you can comment on or make changes to this bug.