Closed Bug 1393277 Opened 3 years ago Closed 3 years ago

cot-verifiable action tasks

Categories

(Firefox Build System :: Task Configuration, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED
mozilla58

People

(Reporter: aki, Assigned: aki)

References

Details

Attachments

(4 files)

What:

The ability to generate a subgraph of the target task graph, and use a previous graph's tasks as kind-dependencies.

E.g., generate a subgraph containing build-signing-win64-nightly/opt and all children. All of build-signing-win64-nightly/opt's dependencies' taskIds are pulled from another decision task's graph, by label.


Why:

- We regularly need to resubmit partial graphs because the breakpoint task hasn't been resolved before the deadline.
- we need to be able to retrigger failed tasks without breaking CoT.
- this would allow us to port release graphs in-tree gradually over time (i.e., one segment is in-tree; another segment is still in releasetasks templates) instead of forcing us to port everything at once.
- in the future, this may allow for splitting 3000+ task nightly graphs into discrete graphs that depend on each other. (A status graph could potentially depend on each of the child graphs for meta-graph status.) Splitting by platform makes a lot of sense.

I'm currently looking at how feasible this is, since this hypothetical feature has come up quite a bit as a potential solution to our problems.
Dustin: I think in the current design, this would go in the optimize step.

a) do you foresee any issues with this proposal?
b) do you think we should split optimize into two steps to allow for this?
Flags: needinfo?(dustin)
This sounds a lot like a retrigger, for which we already have an action.  I suspect it could be implemented as an action.

I don't see how this makes sense as an optimization..
Flags: needinfo?(dustin)
Hm, I just looked at a retrigger action task, and

a) it has task-graph.json!
b) it's run in a decision image!

which is very promising for CoT verification.
In fact, it should be CoT verifiable already. The problem is probably that the retriggered tasks are generated with the original taskGroupId [1], so they look at the original decision task for verification. We either need a breadcrumb back to the action task, or change the taskGroupId to the action task's taskId (preferred).

I also need to figure out how we're creating the retrigger action task. If retrigger/backfill work the same way, we could use backfill to generate the second partial graph. If both try to add to the original taskGroup, we could change their behavior or add a new retrigger-with-new-taskgroup or backfill-with-new-taskgroup action for cot-verifiable graphs. [2]

[1] I need to verify this
[2] this too
(In reply to Aki Sasaki [:aki] from comment #3)

> b) it's run in a decision image!

er, decision workerType. I haven't verified the image.
See Also: → 1384134
It's a decision image too.

I'll leave Brian to comment on distinct taskGroupIds for action tasks.

I know we eventually want to have action tasks find any previous action tasks for a graph so that something like run-missing-tests can find out about tests that were run via actions and not re-run them.  Being in the same taskgraph is useful for this, and maybe whatever mechanism we choose could support CoT verifying (decision + later actions) as a unit, rather than treating actions as distinct decision tasks.

It's definitely easy to add a new action if this isn't quite the same as retrigger.  Most actions run a Python function in the decision environment, so you can do arbitrarily complex stuff there.
Flags: needinfo?(bstack)
If we did create a new taskGroupId, I'd download the task-graph.json from the action task and verify it as its own graph. If we didn't, I'd either need to switch to downloading/verifying the full graph json, or support downloading multiple task-graph.json files. Those are both possible, as long as we can somehow find the decision task + all appropriate action tasks.

I was thinking leaving behind breadcrumbs in task.extra might be a solution... Task `a-prime` was triggered by action task, but originally was taskId `a`.
Worth noting, it's not necessarily the case that action tasks or the tasks they create were defined in the original task-graph.json.  They may be *derived* from that file, but maybe not.  Probably some of those "maybe not" possibilities wouldn't pass CoT of course!
Yeah, I also don't know if the full graph has accurate info, since later steps in `mach taskgraph` can modify task definitions. We may also want to add, say, more partial update mars than were specified at checkout, or create a pushapk task with a different rollout percentage than is in the full graph.

That suggests we'll want to verify the action task's task-graph.json, and use the various methods (json-e, ideally, and maybe include the action task's allowlisted inputs) to make sure the action task is CoT-valid.
Summary: allow `mach taskgraph` to populate kind deps from a previous graph → cot-verifiable action tasks
After thinking about it for a while, the only technical reason we keep action tasks and their descendants in the same group is to make them show up in the same inspector in the tools site.

On a less-technical level, I think this is another case where we aren't really sure what the push-to-taskgroup mapping is supposed to be. My ideal world is one where all tasks involved in a push are in the same taskgroup, but I suppose that won't always be possible.

What if we kept them in the same task group and made them discoverable by having them always write themselves into the index next to the decision task? I believe we planned on doing that at some point anyway.
Flags: needinfo?(bstack)
(In reply to Brian Stack [:bstack] from comment #9)
> After thinking about it for a while, the only technical reason we keep
> action tasks and their descendants in the same group is to make them show up
> in the same inspector in the tools site.

I think that's a non-goal for me. It's nice as long as you don't mind the nightly graph being too slow to easily or quickly view currently. Adding in all retriggers and release graphs would add a ton of noise. I often don't care if all on-push tasks are green, but I care very much if the release graph is green. Having them all in the same taskgroup would make it difficult to tell.

> On a less-technical level, I think this is another case where we aren't
> really sure what the push-to-taskgroup mapping is supposed to be. My ideal
> world is one where all tasks involved in a push are in the same taskgroup,
> but I suppose that won't always be possible.

That may be achievable for on-push tasks, but I don't think that makes sense for nightly or release graphs on that same push.
We're already pushing 3000+ tasks in a single desktop nightly graph, which makes it unwieldy. I think if we want that, having a push-graph-viewer that can view multiple taskgroups makes sense.

> What if we kept them in the same task group and made them discoverable by
> having them always write themselves into the index next to the decision
> task? I believe we planned on doing that at some point anyway.

Possible, but beyond the tools.tc.net perf issues, we also have the issue of having to check all action tasks for CoT information for all CoT verified tasks in that group. Maybe if we had a breadcrumb in task.extra so we don't have to figure out which action/decision task to look at?
Enable chain of trust in action tasks: https://hg.mozilla.org/projects/date/rev/3234b744da40

We may want to publish the input json as an action task artifact, so when we have json-e verification, we can potentially rebuild the action task definition from the tree.

An "add new jobs" action task created a new taskGroupId! In theory this should be chain of trust verifiable now that we have chain of trust enabled for the action task.

The retrigger tasks action task reuses the original decision task's taskGroupId, so we'd have to have some breadcrumb that points at the action task. Right now I'm thinking task.extra.something.

Some possibilities:
task.extra.decisionTaskId
task.extra.actionTaskId
task.extra.originalTaskId -- if we want to point at the retriggered task and specify. Otherwise we could look at the label-to-taskid files from both the decision task and action task and find it.
task.extra.graphHistory.* -- if we want to kind of cluster all of this stuff in one location, rather than littered all over task.extra . Also taskHistory, taskGenealogy, taskBreadcrumbs, taskBreadCrumbs, bikeShedBlue, ...?
To me, it would be easier to have a pointer in the action task's artifacts, and change the taskGroupId to the action task's taskId. I'll see how it looks for both approaches.
(In reply to Aki Sasaki [:aki] from comment #11)
> We may want to publish the input json as an action task artifact, so when we
> have json-e verification, we can potentially rebuild the action task
> definition from the tree.

Hm, this may be problematic, because the action task doesn't have the json; it was built *from* the json, aiui.
I'm not sure if the thing that submits the json can put it somewhere.
CoT verification failed.
1) I need to allowlist the various mach commands. In addition to `./mach taskgraph decision`, we need to support at least `./mach taskgraph add-tasks`, `./mach taskgraph backfill`, and probably `./mach taskgraph action-callback`.
2) CoT verification for decision tasks expects `--option=arg` rather than `--option arg`. Opened https://github.com/mozilla/treeherder/pull/2740
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/8c50b4fff295f5831b04df5f788e042712b0cfa4
Bug 1393277 - Action tasks: Use --option=arg rather than --option arg (#2740)

We currently use `--option=arg` in the decision task [1], but
`--option arg` in treeherder actions. These are essentially equivalent.
In scriptworkers' chain of trust verification, we verify the
action or decision task's command is allowlisted. To do so, we ignore
options [2]. By specifying them as `--option=arg`, we're able to skip
verifying args as well as options.

We have plans to make this better, but until then, let's use
`--option=arg`.

[1] https://hg.mozilla.org/mozilla-central/file/ab2d700fda2b/.taskcluster.yml#l111
[2] https://github.com/mozilla-releng/scriptworker/blob/8437d4cd38abe30bcfd4f8ff4dc40615481fd490/scriptworker/cot/verify.py#L903
(In reply to Aki Sasaki [:aki] from comment #14)
> CoT verification failed.
> 1) I need to allowlist the various mach commands. In addition to `./mach
> taskgraph decision`, we need to support at least `./mach taskgraph
> add-tasks`, `./mach taskgraph backfill`, and probably `./mach taskgraph
> action-callback`.

All but the last of those are old-style actions.yml actions, and are going away shortly.

> 2) CoT verification for decision tasks expects `--option=arg` rather than
> `--option arg`. Opened https://github.com/mozilla/treeherder/pull/2740
As far as I can tell, this is the path retriggers take (I may be wrong):

The treeherder retrigger request [1] calls buildapi to perform the retrigger [2]. This may or may not actually be buildapi, but probably relates to pulse [3].

I searched buildapi for traces of retrigger, but now I'm guessing it's mozilla-taskcluster [4]. If I want to either create a new taskGroupId for the retriggered graph, or call an action task with a task.extra.breadcrumb, I would have to modify this logic.

Aiui, action tasks are under serious change atm. Brian, Dustin: are you planning on changing this current treeherder retrigger behavior any time soon? If so, I should probably hold off. If not, I can try to make changes so we can make retriggers chain of trust verifiable.

[1] https://github.com/mozilla/treeherder/blob/dfa920cbd8819ae72af4884e319081a914ae7477/ui/plugins/controller.js#L335
[2] https://github.com/mozilla/treeherder/blob/dfa920cbd8819ae72af4884e319081a914ae7477/ui/plugins/controller.js#L360
[3] https://github.com/mozilla/treeherder/blob/dfa920cbd8819ae72af4884e319081a914ae7477/ui/plugins/controller.js#L346-L350
[4] https://github.com/taskcluster/mozilla-taskcluster/blob/099675c703b272a0043d92df73feeea08dca8d6c/src/jobs/retrigger.js
Flags: needinfo?(dustin)
Flags: needinfo?(bstack)
Yeah, this is all changing now. It's a bit delayed due to all of the fuss with paths in the tree being messed up (bug 1394952 and friends). However my plan is for almost all of that function to go away and be replaced by an action task calling the retrigger action[0].

[0] https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py
Flags: needinfo?(bstack)
Ok! I'll hold off making changes for retriggers for now.
Flags: needinfo?(dustin)
Blocks: 1397762
Depends on: 1370343
Blocks: 1384134
(In reply to Brian Stack [:bstack] from comment #20)
> Yeah, this is all changing now. It's a bit delayed due to all of the fuss
> with paths in the tree being messed up (bug 1394952 and friends). However my
> plan is for almost all of that function to go away and be replaced by an
> action task calling the retrigger action[0].
> 
> [0]
> https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/
> retrigger.py

Per Greg, this action task work is almost all done, except retriggers. Is that accurate? Do we know when retriggers are going to happen through the retrigger action task?

(I see https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e4261f5b96ebfd63e7cb8af3035ff9fea90c74a5&selectedJob=131969772 was retriggered, but I don't see an action task in treeherder.)
Flags: needinfo?(bstack)
The only work that needs to happen for retriggers is to integrate it into the treeherder button. I can work on that once https://github.com/mozilla/treeherder/pull/2761 lands (which is happening soon I hope).
Flags: needinfo?(bstack)
Priority: -- → P1
So far the framework is in place in-tree to support action tasks.  Even retrigger is implemented in -tree but requires to be hooked up to TH before being used.

If new actions need to be added, or you want to take a look at the framework, it should be good enough to start now (unless bstack disagrees).

Documentation can be found here: http://firefox-source-docs.mozilla.org/taskcluster/taskcluster/actions.html
Oh yeah, adding new tasks and poking around is totally good now and very encouraged. There are some minor changes happening to support bug 1400223, but for the most part things are stable!
https://github.com/mozilla-releng/scriptworker/pull/158 should fix cot verification of retriggers and other callback action tasks, as soon as :bstack's work (including task.extra.parent) lands. Johan is probably the best reviewer, and out until Oct 2; Brian thinks the task.extra.parent patches will land next week at some point.
Hm, what broke?
Attachment #8912121 - Flags: review?(aki)
We can't land this until bug 1400223 lands, which I can't land until we get the new version of treeherder deployed to prod. I will get that to happen asap. Sorry this took until this evening and I broke it during the day today! Hopefully this change makes sense. I'll do some more testing tomorrow.
Comment on attachment 8912121 [details]
Bug 1393277 - Record information about action tasks to support cot

https://reviewboard.mozilla.org/r/183498/#review188828

Thank you!
Attachment #8912121 - Flags: review?(aki) → review+
Comment on attachment 8913487 [details] [diff] [review]
enable chain of trust artifact generation in action tasks

Note that this is for the old-style actions.yml actions. Is that the change we want to make here?
(In reply to Brian Stack [:bstack] from comment #36)
> Comment on attachment 8913487 [details] [diff] [review]
> enable chain of trust artifact generation in action tasks
> 
> Note that this is for the old-style actions.yml actions. Is that the change
> we want to make here?

Ah. I think I'll just focus on the actions.json actions.
Attachment #8913487 - Flags: review?(bstack)
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/03e6ddd50880
Record information about action tasks to support cot r=aki
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/03e6ddd50880
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
We'll need the scriptworker PR + release.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Comment on attachment 8914842 [details]
bug 1393277 - fix raw-file link in action tasks.

https://reviewboard.mozilla.org/r/186114/#review191130
Attachment #8914842 - Flags: review?(bstack) → review+
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e87d535aebd7
fix raw-file link in action tasks. r=bstack
Pushing out scriptworker 5.2.0.
This bug can close when autoland merges :D
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/e87d535aebd7
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.