Closed Bug 1328719 (cot-v2) Opened 3 years ago Closed 2 years ago

[tracking] rebuild the decision task definition for chain of trust verification

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aki, Assigned: aki)

References

(Blocks 1 open bug)

Details

Attachments

(10 files)

857 bytes, patch
dustin
: review+
Details | Diff | Splinter Review
1.63 KB, patch
bstack
: review+
dustin
: review+
Details | Diff | Splinter Review
55 bytes, text/x-github-pull-request
jlorenzo
: review+
Details | Review
61 bytes, text/plain
jlorenzo
: review+
Details
3.03 KB, patch
mtabara
: review+
Details | Diff | Splinter Review
2.81 KB, patch
mtabara
: review+
Details | Diff | Splinter Review
2.88 KB, patch
mtabara
: review+
Details | Diff | Splinter Review
3.74 KB, patch
jlorenzo
: review+
Details | Diff | Splinter Review
2.94 KB, patch
nthomas
: review+
Details | Diff | Splinter Review
1017 bytes, text/plain
Details
In the chain of trust, the decision task is key.  It spawns all of the other tasks in the graph, andtheir task definitions can be verified against the decision task's task-graph.json.  However, the decision task itself could be any arbitrary task; it takes a considerable amount of code to try to determine whether the decision task appears legitimate and tied to a revision in-tree.

During the chain of trust tech talk, :jonasfj and I think :dustin suggested we rebuild the decision task's task definition from the in-tree info (most likely .taskcluster.yml).  If the decision task's task definition matches, and the other tasks match the decision task's task-graph.json, we can definitively say that the graph is generated from that revision in-tree.  We could then remove, or mark as optional, a lot of the other checks in chain of trust verification.

When I looked into this, I noted that this isn't a quick fix.  We will likely need to change how mozilla-taskcluster works for on-push decision task generation, and we definitely need to change how ourhooks-triggered nightlies work.

Here's my current list, which may grow.  These will turn into blocking bugs.

- decision task must be tied to a revision in-tree
 - pre-decision task for hooks? since the hook doesn't know which revision to use at trigger-time
 - we need to be able to see that revision at verify-time, either in the chain of trust artifact, orthe task definition.

- mozilla-taskcluster seems to hold a lot of information that isn't in .taskcluster.yml; ideally that information moves to .taskcluster.yml
 - this can be read as, ".taskcluster.yml doesn't have enough info to rebuild the decision task taskdefinition"
 - we could also clone mozilla-taskcluster and run against .taskcluster.yml at chain of trust verification time, but if mozilla-taskcluster changes independent of mozilla-central, previously good decision task definitions may suddenly become invalid.  probably best to have all information in-tree.

- the hooks decision task must also be verifiable; it should have a similar or same definition as anon-push decision task
 - decision task verification must then be able to handle nightly + on-push cases, and most likely more in the future.

- we have an allowlist for decision task docker image shas.  If we were somehow able to pin that shain-tree, we could get rid of the allowlist.  If the docker-image sha were similarly pinned in-tree, we could remove both allowlists.  This is likely a fragile part of chain of trust verification, so this would be a big win.
Depends on: 1328722
Depends on: 1328727
(In reply to Aki Sasaki [:aki] from comment #0)
> - we have an allowlist for decision task docker image shas.  If we were
> somehow able to pin that shain-tree, we could get rid of the allowlist.  If
> the docker-image sha were similarly pinned in-tree, we could remove both
> allowlists.  This is likely a fragile part of chain of trust verification,
> so this would be a big win.

It looks like we have this here: https://hg.mozilla.org/mozilla-central/file/a2741dd43eea/.taskcluster.yml#l82
I'm not sure we do the same thing with the docker-image sha.  Adding that, and fixing bug 1326436, would allow me to remove the docker image allowlists for decision and docker-image tasks, once this bug is resolved.
Depends on: 1326436
> the hooks decision task must also be verifiable; it should have a similar or same definition as anon-push decision task

I'm not sure this is necessary -- if the decision task it produces is legitimate, then it doesn't matter what produced it.

BTW, this is called a "cron task"
Assignee: nobody → aki
(In reply to Dustin J. Mitchell [:dustin] from comment #2)
> > the hooks decision task must also be verifiable; it should have a similar or same definition as anon-push decision task
> 
> I'm not sure this is necessary -- if the decision task it produces is
> legitimate, then it doesn't matter what produced it.
> 
> BTW, this is called a "cron task"

In my rough wording, the "pre-decision task" maps to the cron task, and the "hooks decision task" maps to the decision task that the cron task generates.

Agreed that the cron task doesn't need to match anything or need to be verified.
Also agreed that it's not necessary that the cron-generated decision task look exactly like the mozilla-taskcluster-generated decision task, as long as we can verify both.  However, the more similar they are, the more we can reuse the verification logic.
Depends on: 1252948
Depends on: 1324767
Duplicate of this bug: 1326436
Alias: cot-v2
Dustin: I think the dep is backwards. We need json-e to proceed with this bug, so bug 1372600 should block this bug. Agree?
Flags: needinfo?(dustin)
There's already a dep on the JSON-e implementation.  Deploying JSON-e depends on this bug being complete (this being one of the places to deploy it).  I'm using that tracker as a way of organizing all of the things required to call JSON-e "done".
Flags: needinfo?(dustin)
Ok. In https://github.com/escapewindow/scriptworker/commits/cotv2 , I have enough checks + hacked hardcodes to get

    verify_cot --task-type decision --cleanup -- A3klZsqPTieZ6q8cUOH2Iw

to only break on datestrings. I have plans for the hardcodes -- mainly I'm missing the hg pushlog and production-branches.json references.

The main question is, how do I override $fromNow to be based upon the task.created datestring? I tried https://github.com/escapewindow/scriptworker/commit/9352bf8f0cc7f25b657b6592d445b009a46a91da , but it appears the builtin takes precedence over the context.
Flags: needinfo?(dustin)
I suppose I can mock datetime.datetime.utcnow() during the call, but that seems pretty hacky.
I tried that in https://github.com/escapewindow/scriptworker/commit/53a95c1df4e6c6a5bf55c5cffb053ba8dae5d107 .
It looks like that won't work, because we calculate utcnow() for every datestring, so things like the expires datestring is off by a couple milliseconds:

AssertionError: [('change',
  'payload.artifacts.public.expires',
  ('2018-09-04T17:23:33.110Z', '2018-09-04T17:23:33.108Z'))]

We can proceed in a few ways:

- CoT can completely ignore datestrings
- CoT can fuzzy match datestrings
- JSON-e can allow for $fromNow to have an overridden `now`. The decision task should set this when it specifies `created` time, and build the others from that date, rather than keep calling utcnow(). Then CoT can rebuild the datestrings from the `created` time.
(In reply to Aki Sasaki [:aki] from comment #9)
> - JSON-e can allow for $fromNow to have an overridden `now`. The decision
> task should set this when it specifies `created` time, and build the others
> from that date, rather than keep calling utcnow(). Then CoT can rebuild the
> datestrings from the `created` time.

This should already be the case, for exactly this reason -- Brian, can you take a look?
Flags: needinfo?(dustin) → needinfo?(bstack)
We've added that to json-e as of a few days ago, but tools/th need to be updated to make use of it afaict. I'll add this to my to-do for my second round of updates to those sites.
Flags: needinfo?(bstack)
See Also: → 1396969
Ah, so if I understand, Aki's regeneration of the action task is correct (basing the times on the same "now") but the original generation of the action task is not?
Oh, I misunderstood in that case. I think we need a new `now` for action tasks since the `now` will be when they were triggered? I admit I don't totally understand what's going on here.
I'm trying to regenerate an on-push decision task atm.
So, I'm guessing mozilla-taskcluster isn't pinning 'now', so `task.created` and `task.payload.artifacts.public.expires` are built on different datetime.datetime.utcnow() calls.

What I would like to have happen is for

a) mozilla-taskcluster and all other things that create these json-e template tasks to pin `now`, then base all $fromNow calculations from `now`. Then CoT can use `task.created` to regenerate all datestrings deterministically.

b) for json-e to allow for overriding `now` in general. I couldn't figure out how, so all my $fromNow calls called `datetime.datetime.utcnow()`, rather than letting me specify what `now` means. I suppose I could

    tmpl = yaml.safe_load(path_to_taskcluster_yml)
    # recursively search tmpl for $fromNow, and change that to $hackedFromNow, so I can override the default `now`, or
    # use mock or something ugly,

or I could

    jsone.render(tmpl, context, now=previous_datestring_now)

Or override `now` in env or context. The latter seems much nicer than the mock / $hackedFromNow solution.
Now that we [will soon] have the ability to rebuild action tasks from their task.extra.action + the decision task's actions.json, we should add that work to the json-e work for cotv2.
Attachment #8939707 - Flags: review?(dustin)
Attachment #8939707 - Flags: review?(dustin) → review+
Keywords: leave-open
No longer depends on: 1324767
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ce40a4ffe44
add task.extra.tasks_for to help rebuild the decision task. r=dustin
Dustin, Brian, I'm able to now verify an hg-push decision task via json-e.
For actions, I'm hitting a couple issues. I'm hoping this is clear, but I'm happy to clarify.

First, `task.payload.env.ACTION_TASK_GROUP_ID` doesn't match `context['action']['taskGroupId']`. We use `action.taskGroupId` in 4 places in `.taskcluster.yml` [1], and this the env var is the only place that differs. I'm guessing this is because the action task's taskGroupId is the decision task's taskId, but the graph that the action task spawns has a taskGroupId of the action task's taskId.

Here's an example action task that shows the different taskGroupIds: [2]. ACTION_TASK_ID is null, ACTION_TASK_GROUP_ID is TVwCEBYSQ3Wzsdpjht2_Zw, but extra.action.context.taskGroupId is B2WpjpnRSiqomo3VgChfLg .

Do you have a recommendation on how to fix? Maybe set this to `ownTaskId`? My current fix [3] is pretty ugly: render the json-e, then if it's an action task, munge `task.payload.env.ACTION_TASK_GROUP_ID`.

[1] https://hg.mozilla.org/mozilla-central/file/9099a6ed993f/.taskcluster.yml#l104
[2] https://queue.taskcluster.net/v1/task/TVwCEBYSQ3Wzsdpjht2_Zw
[3] https://github.com/escapewindow/scriptworker/blob/a3a9bb9eba2706579c3457926d13b5b3513531c7/scriptworker/cot/verify.py#L1200-L1205

Second, `{$json: {$eval: 'DICTIONARY_NAME'}}` results in an unsorted string. We do this twice that I've noticed so far [4][5]. Is there some way we can get sorted output so it's reproducible? Otherwise I may have to special case these env vars, `json.loads()` them, compare the loaded dicts, and ignore them in the comparison. The less I need to special case the better; this is currently looking fairly fragile against future changes. [6]

[4] https://hg.mozilla.org/mozilla-central/file/9099a6ed993f/.taskcluster.yml#l106
[5] https://hg.mozilla.org/mozilla-central/file/9099a6ed993f/.taskcluster.yml#l104
[6] https://github.com/escapewindow/scriptworker/blob/a3a9bb9eba2706579c3457926d13b5b3513531c7/scriptworker/cot/verify.py#L1167-L1196
Flags: needinfo?(dustin)
Flags: needinfo?(bstack)
(In reply to Aki Sasaki [:aki] from comment #19)
> Second, `{$json: {$eval: 'DICTIONARY_NAME'}}` results in an unsorted string.

Attempt at a fix: https://github.com/taskcluster/json-e/pull/221
Looks like I answered my own questions. Thank you rubber ducky! :)

As of e81a019eb8b5d0218e352ef191e12d2ed35694c0 in https://github.com/escapewindow/scriptworker/commits/cotv2 , I'm able to verify decision, action, and cron tasks. I removed the ACTION_TASK_GROUP_ID hack and added a temporary workaround until https://github.com/taskcluster/json-e/pull/221 lands and rolls out.

I did hit another issue: when rebuilding a cron task definition, task['tags'] is empty, so the rebuilt definition doesn't contain a 'tags' key. The actual task has a task['tags'] of {}. Any recommendations on how to deal with this? [1]

[1] https://github.com/escapewindow/scriptworker/blob/e81a019eb8b5d0218e352ef191e12d2ed35694c0/scriptworker/cot/verify.py#L1217-L1220
Flags: needinfo?(dustin)
Flags: needinfo?(bstack)
Working, with 100% test coverage:
https://github.com/mozilla-releng/scriptworker/compare/master...escapewindow:cotv2?expand=1

CoT v1 still works:

    verify_cot --task-type signing --cleanup -- ah37hplCR1GcwuxUPPl7VQ

(m-b nightly signing task)

TODO:
- land https://bugzilla.mozilla.org/attachment.cgi?id=8940495&action=edit
- wait for treeherder to update json-e https://whatsdeployed.io/?owner=mozilla&repo=treeherder&name[]=Stage&url[]=https://treeherder.allizom.org/revision.txt&name[]=Prod&url[]=https://treeherder.mozilla.org/revision.txt
- rebase branch to be more reviewable
- get review
Comment on attachment 8940495 [details] [diff] [review]
task.extra.cron + task.payload.env.ACTION_TASK_GROUP_ID

Sorry I didn't reply to the NI till now. I was confused about what you were asking until I saw this patch.

It's been a while since I looked at actions, but won't this change the semantics of what task_group_id is in an action task[0]? In [1] it is defined as the task group the action was triggered for. Mostly I'm trying to remember why we have them as possibly different values but I _think_ there was a reason.

[0] https://dxr.mozilla.org/mozilla-central/source/taskcluster/mach_commands.py#217
[1] https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/registry.py#118
Comment on attachment 8940495 [details] [diff] [review]
task.extra.cron + task.payload.env.ACTION_TASK_GROUP_ID

After talking in irc, I think this makes sense!
Attachment #8940495 - Flags: review?(bstack) → review+
Interesting.

I think the confusion here is because:
- release promotion actions create a new taskGroupId based on the action taskId
- other actions use the existing taskGroupId based on the action taskGroupId (the decision task's taskId).

https://tools.taskcluster.net/groups/e5cO0K3sSoSMl0bH3JfHsQ/tasks/L-d1CNGUQGuOs_-wckujtw/details is an 'Add new jobs' custom action task. It sets "ACTION_TASK_GROUP_ID": "L-d1CNGUQGuOs_-wckujtw", or `ownTaskId`, per the above patch. But https://tools.taskcluster.net/groups/L-d1CNGUQGuOs_-wckujtw doesn't exist; it created https://tools.taskcluster.net/groups/e5cO0K3sSoSMl0bH3JfHsQ/tasks/KL6JMqr8Rc2TuBENqi-RvQ/details in the same taskGroupId as the other tasks. (This is the desired behavior).

My reading:
- setting `ACTION_TASK_GROUP_ID` to `ownTaskId` doesn't seem to affect what taskGroupId the spawned tasks use... the release promotion tasks use the action task's taskId; other action tasks use the decision task's taskId.
- setting `ACTION_TASK_GROUP_ID` to `ownTaskId` does seem to help with cotv2 verification.

Which means the patch is either harmless or beneficial.

Testing other actions did surface a couple of cotv2 verification holes. Working on those.
Attachment #8940495 - Flags: review?(dustin) → review+
Attachment #8942416 - Flags: review?(jlorenzo)
Pushed by asasaki@mozilla.com
https://hg.mozilla.org/integration/mozilla-inbound/rev/4936907a3a91da2ca0908c31c66d6297bcde472a
task.extra.cron + task.payload.env.ACTION_TASK_GROUP_ID. r=bstack r=dustin
Flags: needinfo?(aki)
Flags: needinfo?(aki)
Attachment #8942416 - Flags: review?(jlorenzo) → review+
Attachment #8944125 - Flags: review?(mtabara) → review+
Attachment #8944126 - Flags: review?(mtabara) → review+
Attachment #8944127 - Flags: review?(mtabara) → review+
Attachment #8944129 - Flags: review?(nthomas) → review+
Attachment #8943856 - Flags: review?(jlorenzo) → review+
Comment on attachment 8944128 [details] [diff] [review]
bump pushapk scriptworker to scriptworker 8.0.0

Review of attachment 8944128 [details] [diff] [review]:
-----------------------------------------------------------------

r+ modulo the missing pushapkscript. Feel free to update the deps pointed out, if you're sure mozapkpublisher can make a TLS connection to Google Play. Otherwise, let's change them in a follow up.

::: modules/pushapk_scriptworker/manifests/init.pp
@@ +32,2 @@
>                  'chardet==3.0.4',
> +                'cryptography==2.1.4',

Nit: I haven't tested if such a version bump would break mozapkpublisher. On the other had, I would love to update the deps. They are about a year old now. If you tested that change, let's do it now. Otherwise, I'd prefer a follow up.

@@ -41,5 @@
> -                'multidict==2.1.6',
> -                'oauth2client==4.1.1',
> -                'pexpect==4.2.1',
> -                'ptyprocess==0.5.1',
> -                'pushapkscript==0.4.1',

Error: pushapkscript disappeared from the list :)

@@ +45,5 @@
> +                'oauth2client==4.1.2',
> +                'pexpect==4.3.1',
> +                'ptyprocess==0.5.2',
> +                'pyOpenSSL==17.5.0',
> +                'pyasn1==0.4.2',

Same comment as cryptography. Please see pyasn1-module too.
Attachment #8944128 - Flags: review?(jlorenzo) → review+
(In reply to Johan Lorenzo [:jlorenzo] from comment #37)
> Comment on attachment 8944128 [details] [diff] [review]
> bump pushapk scriptworker to scriptworker 8.0.0
> 
> Review of attachment 8944128 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r+ modulo the missing pushapkscript. Feel free to update the deps pointed
> out, if you're sure mozapkpublisher can make a TLS connection to Google
> Play. Otherwise, let's change them in a follow up.

Ok. And oops good catch re: pushapkscript =\
Rolled out.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
It looks like the fuzzy task matching isn't quite smart enough to handle the packages (and presumably toolchains) that get passed via an environment variable. If a package or toolchain is retriggered via treeherder, the downstream tasks will have an environment the differs in MOZ_TOOLCHAINS or DOCKER_IMAGE_PACKAGES environment variables.
(In reply to Tom Prince [:tomprince] from comment #41)
> It looks like the fuzzy task matching isn't quite smart enough to handle the
> packages (and presumably toolchains) that get passed via an environment
> variable. If a package or toolchain is retriggered via treeherder, the
> downstream tasks will have an environment the differs in MOZ_TOOLCHAINS or
> DOCKER_IMAGE_PACKAGES environment variables.

I think that means we need to retrigger via the custom action rather than treeherder's retrigger (related: bug 1420482). Fuzzy task matching is pretty limited and noisy, and I will be happy to see it go away in bug 1432364.
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/comm-central/rev/57678a6eaade
Add extra details to help Chain-of-Trust rebuild the decision task.
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
See Also: → 1555988
You need to log in before you can comment on or make changes to this bug.