Open Bug 1492622 Opened 6 years ago Updated 5 years ago

add chain of trust verification support to generic-worker

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: mozilla, Unassigned)

References

(Blocks 1 open bug)

Details

This will allow us to download-and-verify upstream artifacts on generic-worker, mount them for the task, and potentially shut off network access during the task to allow for hermetic Firefox CI. This is likely not in the cards until docker-worker is retired; filing for tracking.
This will end up replacing `fetch-content` and `mach artifact toolchain`. In the mean time, we could teach those to verify artifacts as a stop-gap.
+1. That will still not be foolproof unless it also checks cot signatures; that will likely be a non-starter until we switch to ecdsa signatures.
I was thinking about this some more, and I was wondering if generic-worker needs to *verify* signatures of the artifacts it downloads, or if it can simply record cryptographic hashes of the artifacts it downloads, and publish them. Then later scriptworker tasks in the graph can verify that that checksum against the cot-signed checksum from the upstream tasks.
Agreed, that was a stop gap solution we brainstormed a while back. We didn’t have a way to enforce that all upstream artifacts are downloaded and shas recorded in a trustable way. Also, verifying these downloads from scriptworkers downstream complicates scriptworker cot verification and can make us download and verify an artifact hundreds or thousands of times instead of once on the generic worker task side. I think this could be a stop gap solution, though it may have costs in both human implementation and worker compute time that are non-trivial. I believe the best long term solution is to have end-to-end cot verification.
Component: Generic-Worker → Workers

Hi Aki,

I like the idea of generic-worker verifying signed artifact signatures when mounting artifacts. Is there an authoritative source for the public key of a given worker type? I'm wondering how this would look for redeployable taskcluster environments. Perhaps worker manager should have an endpoint to return the public key of a given worker type?

Flags: needinfo?(aki)

(In reply to Pete Moore [:pmoore][:pete] from comment #5)

I like the idea of generic-worker verifying signed artifact signatures when mounting artifacts.

For completeness, we should verify the entire chain back to the tree as well. Verifying the artifact sha and signature means that the artifact was created in a trusted workerType. Verifying the chain means that the request (task definition) came from a trusted tree.

I'm currently thinking we should port scriptworker.cot.verify to a standalone module/tool that can both verify the chain and download artifacts with verification. Golang is a likely candidate so we can import it in generic-worker and distribute standalone binaries for others to use, rather than requiring they install a python virtualenv.

Is there an authoritative source for the public key of a given worker type? I'm wondering how this would look for redeployable taskcluster environments. Perhaps worker manager should have an endpoint to return the public key of a given worker type?

Currently that's here. We should pull that out of scriptworker as well.

I really like the idea of exposing the cot public keys in worker manager. At some point we may have the capability of having public keys per workerType, not just per worker implementation, reducing the fallout if a key is compromised. However, I have misgivings about using it as the source of truth: if we can update our CoT trust through scopes, then CoT is no longer a second factor to scopes.

I gave this some thought and now I'm thinking that worker manager could have information about runtime cot public keys per worker implementation or workerType (possibly even per-worker?). But something like ci-configuration would have the definitive trusted set of public keys: by using vcs, we have an audit log, history, and reviews. (If we have standalone cot verification, we probably have the entire cot verification config landed there.)

Flags: needinfo?(aki)

A move of CoT into the platform should take the form of a proposal that leads to an RFC, and be considered on its own merits (that is, without much weight given to "that's how we do it now"). I think that would be a real strength for Taskcluster!

That's also a pretty substantial project, so we should think about the schedule for such a thing -- I would guess this could happen after docker-worker is deprecated and after we have migrated Firefox CI production, at least.

I've chatted with :aki about this some, and I'm not sure that I agree that full chain-of-trust verification should move into tree.

I do think it would be valuable to verify that mounted task artifacts hashes match the hash recorded in chain-of-trust.json. I'm not sure if the workers should do any more work around verifying chain-of-trust.

(In reply to Tom Prince [:tomprince] from comment #8)

I've chatted with :aki about this some, and I'm not sure that I agree that full chain-of-trust verification should move into tree.

I do think it would be valuable to verify that mounted task artifacts hashes match the hash recorded in chain-of-trust.json. I'm not sure if the workers should do any more work around verifying chain-of-trust.

I think this is analogous to verifying an ssl connection is signed, but not verifying that it's chained to a trusted CA, and then verifying all previous connections when we do something sensitive like see a password field or a credit card field. It's less secure overall: it opens up potential backdoor ways to add malicious traffic when we don't do the full check. It complicates the full check further, because we have to verify previous traffic as well as the current traffic. If the check happens in a library, you're essentially telling it to run a partial check most of the time, and then a greater-than-normal check at the most sensitive times; overall this will result in a more complex library. I believe that running the full check of the current task's inputs and request is the most sane way forward, and I agree that means we'll need to craft and RFC a proposal to make CoT part of the platform.

QA Whiteboard: [lang=go]

(In reply to Aki Sasaki [:aki] (he/him) (UTC-7) from comment #9)

we'll need to craft and RFC a proposal to make CoT part of the platform.

Aki: given recent discussions around this, is this bug still valid?

Flags: needinfo?(aki)

This is still a want from my side to enable end-to-end CoT verification in firefoxci taskgraphs. I could see potentially adding generic-worker hooks to allow for pre-task calls, and then adding CoT checks to firefoxci pools.

As long as we're tracking it somehow, I'm open what we do with this bug.

Flags: needinfo?(aki)
You need to log in before you can comment on or make changes to this bug.