Closed Bug 1450012 Opened 7 years ago Closed 5 years ago

Prevent retriggering release tasks from treeherder

Tracking

(firefox-esr60 fixed, firefox67 fixed, firefox68 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla68

Tracking Flags:

Tracking

Status

firefox-esr60

---

fixed

firefox67

---

fixed

firefox68

---

fixed

People

(Reporter: tomprince, Assigned: tomprince)

References

(Blocks 1 open bug,
URL
)

Details

Attachments

(5 files)

rerun action 7 years ago Aki Sasaki (not active) 3.40 KB, patch	bstack : review+	Details \| Diff \| Splinter Review
Bug 1450012: [taskgraph] Move rerun action to retrigger.py; r?dustin 6 years ago Tom Prince [:tomprince] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1450012: [taskgraph] Disable retrigger action for many tasks; r?dustin 6 years ago Tom Prince [:tomprince] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1450012: [taskgraph] Add an action to retirgger multiple tasks at once; r?dustin 6 years ago Tom Prince [:tomprince] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1450012: [taskgraph] Make `retrigger-multiple` action selectively re-run tasks; r?dustin 6 years ago Tom Prince [:tomprince] 47 bytes, text/x-phabricator-request		Details \| Review

Tom Prince [:tomprince]

Assignee

Description

•

7 years ago

No description provided.

Tom Prince [:tomprince]

Assignee

Updated

•

7 years ago

Summary: Prevent retriggering release tasks. → Prevent retriggering release tasks from treeherder

Tom Prince [:tomprince]

Assignee

Comment 1

•

7 years ago

During 59.0.2, both both ryan and aki reran the same repackage task, then ryan cancelled his (latest). aki had to cancel the repackage-signing task, then rerun both in order. this could have forced us to do a build2 if it wasn't caught in time. we should consider moving off treeherder and into a releaseduty dashboard, and/or going with stricter handoffs Given that treeherder doesn't show re-runs clearly, it was suggested either removing the release tasks from treeherder, or at least removing the ability to retrigger them via treeherder. This can probably be done by specifying the action context (https://firefox-source-docs.mozilla.org/taskcluster/taskcluster/actions.html#setting-the-action-context)

URL: https://trello.com/c/L9iwy6j2

Jordan Lund (:jlund)

Comment 2

•

7 years ago

@ryanvm, @dburns - any concerns with removing mutable actions from release specific tasks within Treeherder? Or even removing the tasks themselves from view? Just removing mutable actions (retrigger) would likely be the preference for us so that we keep some release transparency on TH

Flags: needinfo?(ryanvm)

Flags: needinfo?(dburns)

Ryan VanderMeulen [:RyanVM]

Comment 3

•

7 years ago

Is anybody going to notice when these jobs fail on Nightlies if they aren't visible on TH?

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 4

•

7 years ago

I think removing them from visibility on Treeherder would be a mistake. It also feels to me like comment 1 is throwing the baby out with the bath water. Ultimately, I've seen two ways the current situation can bite us: 1) Retriggering a job when a rerun was intended. This obviously makes for a lot of downstream CoT issues and should be much more difficult to do than it currently is. 2) Rerunning a job that was already done so and since completed successfully (the situation I hit with Aki in comment 1). I would suggest making rerun a no-op if an existing task ID has already completed successfully. I can't see any good reason to rerun a task with the same ID except under failure conditions. That would have avoided situation #2. Just finishing with a status of "completed" would suffice? Situation #1 could be avoided by changing the default action from retrigger to rerun on jobs where there is sensitivity to CoT. Note that this would be a beneficial change in general since there are CI jobs which depend on signed builds (xpcshell) which currently aren't run on pushes where a failed build was retriggered because the subsequent signing step failed. In other words, I think a lot of these issues can be resolved by having some safer/saner defaults rather than making things more restrictive and less transparent. Not to mention it'd be signing up the releaseduty folks to monitor more than they currently are since as I mentioned previously, I'm pretty sure than the sheriffs are currently the only people who'll notice when a Nightly release promotion job fails.

Aki Sasaki (not active)

Comment 5

•

7 years ago

There are a couple issues here: Retriggers: Preventing retriggers by disabling the retrigger action on [certain?] relpro action tasks would possibly help avoid the sheriffs-retriggering-jobs scenario. It also wouldn't affect nightlies until we move to shippable-builds with nightly promotion, and we could potentially send a flag to keep retriggers available if wanted (probably not; I think we want reruns over retriggers here as well). If retriggers are disabled, that removes one issue with release tasks on treeherder. Reruns: This could be similar to the above: nightly promotion could show on treeherder, and release promotion could be on its own dashboard. (In reply to Ryan VanderMeulen [:RyanVM] from comment #4) > I would suggest making rerun a no-op if an existing task ID has already > completed successfully. I can't see any good reason to rerun a task with the > same ID except under failure conditions. That would have avoided situation > #2. Just finishing with a status of "completed" would suffice? Sometimes we want to rerun when it's completed. Possibly a --force ? Or maybe we need to determine when a rerun-on-completion is useful or needed, and weigh those against the danger of rerunning a completed task when that's not the desired behavior. > Situation #1 could be avoided by changing the default action from retrigger > to rerun on jobs where there is sensitivity to CoT. Note that this would be > a beneficial change in general since there are CI jobs which depend on > signed builds (xpcshell) which currently aren't run on pushes where a failed > build was retriggered because the subsequent signing step failed. I think the above disable-retriggers-on-[certain?]-relpro-graphs solves most of this. There may be an issue if we accidentally retrigger the action task itself. > In other words, I think a lot of these issues can be resolved by having some > safer/saner defaults rather than making things more restrictive and less > transparent. Not to mention it'd be signing up the releaseduty folks to > monitor more than they currently are since as I mentioned previously, I'm > pretty sure than the sheriffs are currently the only people who'll notice > when a Nightly release promotion job fails. I think this makes sense. We could also address the issue of sheriffs being the only ones noticing if a nightly release promotion task failing.

Aki Sasaki (not active)

Comment 6

•

7 years ago

(In reply to Aki Sasaki [:aki] from comment #5) > Retriggers: > Preventing retriggers by disabling the retrigger action on [certain?] relpro > action tasks would possibly help avoid the sheriffs-retriggering-jobs > scenario. It also wouldn't affect nightlies until we move to > shippable-builds with nightly promotion, and we could potentially send a > flag to keep retriggers available if wanted (probably not; I think we want > reruns over retriggers here as well). If retriggers are disabled, that > removes one issue with release tasks on treeherder. (thanks to Tom for the idea; it's a good one.)

Ryan VanderMeulen [:RyanVM]

Comment 7

•

7 years ago

(In reply to Aki Sasaki [:aki] from comment #5) > Sometimes we want to rerun when it's completed. Possibly a --force ? Or > maybe we need to determine when a rerun-on-completion is useful or needed, > and weigh those against the danger of rerunning a completed task when that's > not the desired behavior. Having a --force option makes sense for those situations. From my POV, there's no issue with someone making an informed decision to do the non-standard thing, but that would at least require them to think about it first. And in my case, seeing that it had already run successfully would have stopped me from going any further with things and avoided scenario #1. > I think the above disable-retriggers-on-[certain?]-relpro-graphs solves most > of this. There may be an issue if we accidentally retrigger the action task > itself. I'm a bit confused with how this proposal works with regular builds in CI. Does disabling retriggers mean rerun will happen by default instead? Basically, I'm worried about a regular Windows build hitting an infra failure (say bug 1444168 as has frequently been the case lately), being retriggered, than having the Bs job fail because it was a retrigger and therefore cause downstream test jobs to not run (when a rerun on the failed build in the first place would have allowed everything to proceed without issue). One more data point: After my NI in this bug, I observed a couple random l10n jobs fail during the 60.0b8 release promotion process, both within a minute of having done so. Normally I would have rerun them immediately afterwards without saying anything. End of story, everyone's happy. Instead I waited to see when they would be noticed and rerun by someone on release duty without any prompting. They were rerun 23 and 16 min later respectively. I'm not saying this to cast aspersions on our release duty people, just pointing out that everyone gets busy and having more capable eyes on these things allows for a better level of service than we will otherwise get. To summarize, these proposals largely mean less work for sheriffs and more for releaseduty, so ultimately, I'd say to do whatever you feel makes sense and just make sure it's well communicated so that procedures can be updated accordingly. I just tend to err on the side of sharing the load where possible and this feels like something where that can be reasonably accomplished.

Aki Sasaki (not active)

Comment 8

•

7 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #7) > Having a --force option makes sense for those situations. Let's talk to the tc team to see if this is doable? It may require an API and cli tweak, so it may not necessarily be quick. > > I think the above disable-retriggers-on-[certain?]-relpro-graphs solves most > > of this. There may be an issue if we accidentally retrigger the action task > > itself. > > I'm a bit confused with how this proposal works with regular builds in CI. This disable-retrigger-action would be graph-specific. So the on-push, "regular builds in CI", would have a retrigger action enabled. Nothing changes there. For [certain?] release graphs, the actions.json would eliminate the retrigger action, and potentially the "add-tasks" action and others. However, now that I think about it, treeherder may still go to the on-push decision task for actions.json, even if the task in question is not in the on-push graph. We'd have to investigate to see if this would actually solve the problem. > Does disabling retriggers mean rerun will happen by default instead? > Basically, I'm worried about a regular Windows build hitting an infra > failure (say bug 1444168 as has frequently been the case lately), being > retriggered, than having the Bs job fail because it was a retrigger and > therefore cause downstream test jobs to not run (when a rerun on the failed > build in the first place would have allowed everything to proceed without > issue). I think this is bug 1420482. > One more data point: After my NI in this bug, I observed a couple random > l10n jobs fail during the 60.0b8 release promotion process, both within a > minute of having done so. Normally I would have rerun them immediately > afterwards without saying anything. End of story, everyone's happy. > > Instead I waited to see when they would be noticed and rerun by someone on > release duty without any prompting. They were rerun 23 and 16 min later > respectively. I'm not saying this to cast aspersions on our release duty > people, just pointing out that everyone gets busy and having more capable > eyes on these things allows for a better level of service than we will > otherwise get. I believe the releaseduty dashboard proposal (https://github.com/mozilla-releng/releasewarrior-2.0/issues/114) would help solve this. It'll ping anyone with the dashboard loaded. It would be a shared view, so we don't have to worry about the treeherder failed task still showing. I'm certainly open to you having access to that dashboard if you want to watch for failed release tasks - this is more about the confusion of the failed task still showing as needing action on treeherder, even if the other dashboard(s) show it as having been acted upon. > To summarize, these proposals largely mean less work for sheriffs and more > for releaseduty, so ultimately, I'd say to do whatever you feel makes sense > and just make sure it's well communicated so that procedures can be updated > accordingly. I just tend to err on the side of sharing the load where > possible and this feels like something where that can be reasonably > accomplished.

Aki Sasaki (not active)

Updated

•

7 years ago

Comment 9

•

7 years ago

@edmorely might be more appropriate to loop in than dburns. Just want to keep TH looped in

Flags: needinfo?(dburns) → needinfo?(emorley)

Aki Sasaki (not active)

Updated

•

7 years ago

Depends on: 1450094

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

7 years ago

> Let's talk to the tc team to see if this is doable? It may require an API and cli tweak, so it may not necessarily be quick. As long as this change is entirely within actions, this is pretty easy: just add a `force` property to the action schema, that defaults to false. Then set `"force": true` when you want to force. > However, now that I think about it, treeherder may still go to the on-push decision task for actions.json, even if the task in question is not in the on-push graph. We'd have to investigate to see if this would actually solve the problem. The actions spec says to use the decision task's actions.json. Within Taskcluster, the decision task for task X is the task with ID x.taskGroupId.

Aki Sasaki (not active)

Comment 11

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #10) > > Let's talk to the tc team to see if this is doable? It may require an API and cli tweak, so it may not necessarily be quick. > > As long as this change is entirely within actions, this is pretty easy: just > add a `force` property to the action schema, that defaults to false. Then > set `"force": true` when you want to force. We were talking about taskcluster-cli, which I think would require an API change - bug 1450094. However, we could add an action to rerun, check the status, and check the `force` input bool if it's completed (or only allow failure/exception reruns in the action, and fall back to tc-cli to "force"). RyanVM how would you feel about that? Improvement? Prefer the cli? If we add a rerun custom action, I believe the decision and action tasks would need rerun scopes, and each user who uses the rerun action as well. > > However, now that I think about it, treeherder may still go to the on-push decision task for actions.json, even if the task in question is not in the on-push graph. We'd have to investigate to see if this would actually solve the problem. > > The actions spec says to use the decision task's actions.json. Within > Taskcluster, the decision task for task X is the task with ID x.taskGroupId. Cool. Sounds like if Treeherder uses the pushlog ID to find the taskGroupId, that would be a Treeherder bug.

Ryan VanderMeulen [:RyanVM]

Comment 12

•

7 years ago

(In reply to Aki Sasaki [:aki] from comment #11) > However, we could add an action to rerun, check the status, and check the > `force` input bool if it's completed (or only allow failure/exception reruns > in the action, and fall back to tc-cli to "force"). RyanVM how would you > feel about that? Improvement? Prefer the cli? This is for retriggers from TH? That sounds fine to me (only allowing rerun on failure/exception runs and forcing via cli).

Aki Sasaki (not active)

Comment 13

•

7 years ago

Current attempt at this rerun action: https://hg.mozilla.org/projects/maple/rev/1dd3486c4ac396ee51c80db5240f97ae5a42fc61 https://hg.mozilla.org/projects/maple/rev/1f224551bbbbb0126de988ba28188e823f6be3ff Needs testing.

Ed Morley [:emorley]

Comment 14

•

7 years ago

(In reply to Jordan Lund (:jlund) from comment #2) > any concerns with removing mutable actions from release > specific tasks within Treeherder? Or even removing the tasks themselves from > view? Just removing mutable actions (retrigger) would likely be the > preference for us so that we keep some release transparency on TH Hi! I'm happy to defer to the Taskcluster team/sheriffs/release engineering on this. The Treeherder team doesn't have a huge amount of context here (eg the task scheduling feature in Treeherder was written by the Taskcluster team) and the choice of approach doesn't really affect us :-)

Flags: needinfo?(emorley)

Aki Sasaki (not active)

Comment 15

•

7 years ago

Attached patch rerun action — Details — Splinter Review

This is my current rerun action WIP. It fails on trying to rerun completed tasks - yay! It also fails on rerunning failed/exception tasks - boo! [task 2018-04-03T19:12:57.533465Z] HTTPError: 403 Client Error: Forbidden for url: http://taskcluster/queue/v1/task/dBpfbtV5Qfq42UenI34HYQ/rerun https://tools.taskcluster.net/groups/UWwSJe4kQMOn9Kd0gx8i-Q/tasks/YSyAwCJ1TFuuAN3z9a5LHA/details Is this because we only have scopes for queue:rerun-task:gecko-level-1/* ? We may need to add level-2 and level-3, for the appropriate roles.

Aki Sasaki (not active)

Comment 16

•

7 years ago

(In reply to Aki Sasaki [:aki] from comment #15) > Is this because we only have scopes for queue:rerun-task:gecko-level-1/* ? > We may need to add level-2 and level-3, for the appropriate roles. Did so; green!

Aki Sasaki (not active)

Updated

•

7 years ago

Attachment #8964660 - Attachment description: wip - rerun action → rerun action

Attachment #8964660 - Flags: review?(bstack)

Brian Stack [:bstack]

Comment 17

•

7 years ago

Comment on attachment 8964660 [details] [diff] [review] rerun action I don't think that status_task needs that testing mode path. We allow things to get from apis in testing mode but just mock out the bits where we would be creating tasks and such things. Checking the status of a task should be just fine. Otherwise looks great!

Attachment #8964660 - Flags: review?(bstack) → review+

Pulsebot

Comment 18

•

7 years ago

Pushed by asasaki@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5da47724cb19 add rerun task. r=bstack

Aki Sasaki (not active)

Updated

•

7 years ago

Keywords: leave-open

Aki Sasaki (not active)

Comment 19

•

7 years ago

We still need to address the retriggers, but a) we have reruns, and b) we can possibly default to reruns for non-tests in bug 1420482.

Pulsebot

Comment 20

•

7 years ago

Pushed by asasaki@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/eecf912671f3 fix lint issue. r=lintfix

Eliza Balazs [:ebalazs_]

Comment 21

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5da47724cb19 https://hg.mozilla.org/mozilla-central/rev/eecf912671f3

Aki Sasaki (not active)

Comment 22

•

7 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/af372a7d4fd7 https://hg.mozilla.org/releases/mozilla-beta/rev/345b67c3fb35

BugBot [:suhaib / :marco/ :calixte]

Comment 23

•

6 years ago

The leave-open keyword is there and there is no activity for 6 months. :dustin, maybe it's time to close this bug?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

6 years ago

Per Aki's comment 19, looks like no?

Flags: needinfo?(dustin) → needinfo?(aki)

Aki Sasaki (not active)

Comment 25

•

6 years ago

If the keyword is problematic, we could remove it; otherwise I'd lean towards keeping this open til we either resolve comment 19 in some manner.

Flags: needinfo?(aki)

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Keywords: leave-open

Version: Version 3 → 3 Branch

Dylan Hardison [:dylan] (he/him)

Updated

•

6 years ago

Keywords: leave-open

Tom Prince [:tomprince]

Assignee

Updated

•

6 years ago

Blocks: 1535193

Tom Prince [:tomprince]

Assignee

Updated

•

6 years ago

Assignee: nobody → mozilla

Tom Prince [:tomprince]

Assignee

Comment 27

•

6 years ago

Attached file Bug 1450012: [taskgraph] Move rerun action to retrigger.py; r?dustin — Details

In order to prevent retriggers for release tasks, we will cause the retrigger
action to rerun instead, so move the cdoe togehter.

Tom Prince [:tomprince]

Assignee

Comment 28

•

6 years ago

Attached file Bug 1450012: [taskgraph] Disable retrigger action for many tasks; r?dustin — Details

Many tasks (release tasks and cached tasks, in particular) should be re-run rather
than retriggered. Instead, make the retrigger action re-run them instead.

Tom Prince [:tomprince]

Assignee

Comment 29

•

6 years ago

Attached file Bug 1450012: [taskgraph] Add an action to retirgger multiple tasks at once; r?dustin — Details

In Bug 1519599, treeherder switched to using add-new-jobs to retrigger jobs, since
there wasn't an action to retrigger multiple jobs. This prevents us from adding logic
to rerun some jobs instead of retriggering them.

This adds a new action that takes input like add-new-jobs, but that we can add logic
to handle rerun vs. retrigger in. Additionally, the input it takes is designed
to make Bug 1521032 easier to implement.

Tom Prince [:tomprince]

Assignee

Comment 30

•

6 years ago

Attached file Bug 1450012: [taskgraph] Make `retrigger-multiple` action selectively re-run tasks; r?dustin — Details

Many tasks (release tasks and cached tasks, in particular) should be re-run rather
than retriggered. Instead, make the retrigger-multiple action re-run them instead.

Phabricator Automation

Updated

•

6 years ago

Attachment #9057752 - Attachment description: Bug 1450012: [taskgraph] Make retrigger action re-run many tasks; r?dustin → Bug 1450012: [taskgraph] Disable retrigger action for many tasks; r?dustin

Pulsebot

Comment 31

•

6 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/integration/autoland/rev/acfc18211ade [taskgraph] Move rerun action to retrigger.py; r=dustin https://hg.mozilla.org/integration/autoland/rev/fe831ef6691f [taskgraph] Disable retrigger action for many tasks; r=dustin https://hg.mozilla.org/integration/autoland/rev/43956eb12043 [taskgraph] Add an action to retirgger multiple tasks at once; r=dustin https://hg.mozilla.org/integration/autoland/rev/c33e7bb2397c [taskgraph] Make `retrigger-multiple` action selectively re-run tasks; r=dustin

Cristian Brindusan [:cbrindusan]

Comment 32

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/acfc18211ade
https://hg.mozilla.org/mozilla-central/rev/fe831ef6691f
https://hg.mozilla.org/mozilla-central/rev/43956eb12043
https://hg.mozilla.org/mozilla-central/rev/c33e7bb2397c

Tom Prince [:tomprince]

Assignee

Comment 33

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/3f746debc521
https://hg.mozilla.org/releases/mozilla-beta/rev/8667f041f6c2
https://hg.mozilla.org/releases/mozilla-beta/rev/bb082e4ef0d4
https://hg.mozilla.org/releases/mozilla-beta/rev/4649dbd1d133

status-firefox67: --- → fixed

Tom Prince [:tomprince]

Assignee

Comment 34

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr60/rev/60bb1084cbbf
https://hg.mozilla.org/releases/mozilla-esr60/rev/61d7aa3f23cc
https://hg.mozilla.org/releases/mozilla-esr60/rev/a8cdcd37edaa
https://hg.mozilla.org/releases/mozilla-esr60/rev/1061c70b07e2

status-firefox-esr60: --- → fixed

Tom Prince [:tomprince]

Assignee

Updated

•

5 years ago

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: leave-open

Julien Cristau [:jcristau]

Updated

•

5 years ago

status-firefox68: --- → fixed

Target Milestone: --- → mozilla68

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

5 years ago

Regressions: 1559071

You need to log in before you can comment on or make changes to this bug.