Closed Bug 1287112 Opened 8 years ago Closed 8 years ago

enable chain-of-trust artifact generation in generic-worker

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #1284991 +++

Bug 1284991 is for docker-worker, this bug is for generic-worker. Everything else is the same.

========

For TaskCluster nightly security, we need chain of trust artifacts generated+signed by the workers.  This will be enabled by a boolean in the task definition (task.payload.features.generateCertificate = true).  It will be signed by an embedded GPG key, which will be unique per AMI.

* generate hashes for all task artifacts
* generate CoT json
* plaintext-sign the json with embedded private GPG key
* upload signed json

The chain-of-trust artifact is as follows: (public/certificate.json.gpg)

{
  "artifacts": [
    {
      "name": "public/live_backing.log",
      "sha256": "..."
    },
    ...
  ],
  "task": {
    // taskdefn
  },
  "taskId":      '...', // taskId of current task
  "runId":       ...,   // RunId of current run
  "workerGroup": '...', // workerGroup
  "workerId":    '...', // workerId
  "extra": {
    imageHash: '<hash(dockerImage)>', // as given by docker hub (maybe nice to have)

    imageArtifactSha256: '...', // sha256 of the image artifact, if any
    // link to docker image artifact builder if applicable

    region: 'us-west-2',

    instanceId: '...',

    instanceType: '...',

    publicIpAddress: '...',

    privateIpAddress: '...',
  }
}

then plaintext-gpg signed, so it's both human-readable and machine-verifiable.

Then add it to the list of artifacts to upload, and upload.

Ideally the private key is never in a human's hands, only the public key, which will be added to a git repo or gpg keyserver.  (Discussion on this in the previous bug 1284968 )
:jonasfj was thinking we'd implement in docker-worker first, and then once the rate of change settles, add it to the generic- and taskcluster- workers.  However, we do know we need CoT support for Windows and Mac tier 1 support, so adding it to the various workers in parallel may be worth the churn.

For generic-worker, the 'extra' section will probably look a bit different than docker-worker's.
Blocks: 1298553
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8788424 - Flags: review?(garndt)
Attachment #8788426 - Flags: review?(garndt)
Aki, Greg,

Some comments/thoughts/questions:



1) Not embedding hash type in hash value

At the moment we have e.g.

  "artifacts": [
    {
      "name": "SampleArtifacts/_/X.txt",
      "hash": "sha256:8308d593eb56527137532595a60255a3fcfbe4b6b068e29b22d99742bad80f6f"
    },
    {
      "name": "public/logs/certified.log",
      "hash": "sha256:9e5898528bfedd5484283e7e7656f9bd940f35679c91bdd08832f77ea33a2c64"
    }
  ],

I think it might be better to have instead:

  "artifacts": [
    {
      "name": "SampleArtifacts/_/X.txt",
      "sha256": "8308d593eb56527137532595a60255a3fcfbe4b6b068e29b22d99742bad80f6f"
    },
    {
      "name": "public/logs/certified.log",
      "sha256": "9e5898528bfedd5484283e7e7656f9bd940f35679c91bdd08832f77ea33a2c64"
    }
  ],

One advantage is that it is easy to add hashes (e.g. MD5) and also possible to support more than one, give them different regex pattern for validation, etc.



2) Having a set of artifacts rather than a list

A set has the advantage that lookup is easier on a given artifact, and also that the entries are a set (so same artifact name cannot have two different values), e.g.:

  "artifacts": {
    "SampleArtifacts/_/X.txt": {
      "sha256": "8308d593eb56527137532595a60255a3fcfbe4b6b068e29b22d99742bad80f6f"
    },
    "public/logs/certified.log": {
      "sha256": "9e5898528bfedd5484283e7e7656f9bd940f35679c91bdd08832f77ea33a2c64"
    }
  },



3) Call "extra" something like "instance" or "environment" instead

The "extra" section stores instance metadata, e.g.:

  "extra": {
    "publicIpAddress": "12.34.56.78",
    "privateIpAddress": "87.65.43.21",
    "instanceId": "i123456789",
    "instanceType": "c3.2xlarge",
    "region": "us-west-2a"
  }

Should we name this "instance" or "environment" instead? In the case of docker-worker, it also has imageHash. I think "extra" is a bit vague, like "misc" or "other" it doesn't really capture what it represents, so could become a bit of a dumping ground.



4) No imageHash in generic worker certificate

Just want to highlight for the record that imageHash doesn't exist in generic worker certificate.json (although Aki I see you'd already thought about this in comment 1).



5) JSON Schema for certificate

Will we publish a json schema to validate the certificate.json against? This may be useful for stemming problems at source. We could publish it somewhere under https://schemas.taskcluster.net/ where we have all the other ones.



6) Generating keypairs on workers

I've exposed a generic-worker subcommand to generate a public/private opengpg keypair. I anticipate that this will be called during AMI creation for a given worker type, and this will be managed by Release Engineering. I did not make it part of the generic-worker install subcommand since the certificate creation might be outside of the installation workflow. In the gw codebase I've included an example of calling it. By detaching it from installation, it also means a keypair can be created outside of generic worker entirely, if required. Currently the command saves the private key, and writes the public key to standard out.



7) Signing OpenPGP keys

Does the generic worker need to be concerned with any parties signing the key it creates, or will that all be external to the generic worker?



8) .sig or .signed extension instead of .gpg for certificate?

I feel that the .gpg extension is misleading for certificate.json.gpg since normally this would represent a binary signed file, afaik. Would the extension .sig or .signed or something else be more idiomatic, and avoid confusion that people expect to be able to run `gpg -d` against the file?



9) Alternative name to "certificate" (to avoid confusion with OpenPGP certificates)

I feel that the term "certificate" for this json blob is also misleading, especially as in openpgp there is already the concept of a certificate, which has a different meaning. Since it stores that frozen inputs (e.g. imageHash) and frozen outputs (artifact hashes) together with the instance metadata, maybe something like executionMetadata.json.signed ?



I realise this is a lot of feedback that I gathered while working on the implementation, so I look forward to your responses! Let me know if you prefer me to set up a meeting instead...
Flags: needinfo?(garndt)
Flags: needinfo?(aki)
Note, the PRs in this bug are ready now for review (I've finished my work on them).
Comment on attachment 8788424 [details] [review]
Github Pull Request for taskcluster-docs

Switching over review to :dustin as :garndt is tied up at the moment.
Attachment #8788424 - Flags: review?(garndt) → review?(dustin)
Comment on attachment 8788426 [details] [review]
Github Pull Request for generic-worker

Switching review over to :wcosta as :garndt is tied up at the moment.
Attachment #8788426 - Flags: review?(garndt) → review?(wcosta)
Thanks for the brainstorming and feedback!

(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> 1) Not embedding hash type in hash value
>
> At the moment we have e.g.
>
>   "artifacts": [
>     {
>       "name": "SampleArtifacts/_/X.txt",
>       "hash":
> "sha256:8308d593eb56527137532595a60255a3fcfbe4b6b068e29b22d99742bad80f6f"
>     },
>     {
>       "name": "public/logs/certified.log",
>       "hash":
> "sha256:9e5898528bfedd5484283e7e7656f9bd940f35679c91bdd08832f77ea33a2c64"
>     }
>   ],
>
> I think it might be better to have instead:
>
>   "artifacts": [
>     {
>       "name": "SampleArtifacts/_/X.txt",
>       "sha256":
> "8308d593eb56527137532595a60255a3fcfbe4b6b068e29b22d99742bad80f6f"
>     },
>     {
>       "name": "public/logs/certified.log",
>       "sha256":
> "9e5898528bfedd5484283e7e7656f9bd940f35679c91bdd08832f77ea33a2c64"
>     }
>   ],
>
> One advantage is that it is easy to add hashes (e.g. MD5) and also possible
> to support more than one, give them different regex pattern for validation,
> etc.

Sure, that makes sense.  I wanted a way to specify the hash type rather than having it assumed `hash` is always sha256.  Allowing for multiple is a nice feature, though I'm not sure if we'll use it.

On changes to the CoT schema, in general:

* it's better to change things now than when there are multiple worker types live using this format, and relying on it.
 * maybe we should bake in a cot schema version here?
* you seem motivated to change generic-worker, and I'm hoping this code can get ported over to taskcluster-worker easily.  I'm motivated to change scriptworker once we decide on something.  Do we have someone to change the already-landed docker-worker?

> 2) Having a set of artifacts rather than a list
>
> A set has the advantage that lookup is easier on a given artifact, and also
> that the entries are a set (so same artifact name cannot have two different
> values), e.g.:
>
>   "artifacts": {
>     "SampleArtifacts/_/X.txt": {
>       "sha256":
> "8308d593eb56527137532595a60255a3fcfbe4b6b068e29b22d99742bad80f6f"
>     },
>     "public/logs/certified.log": {
>       "sha256":
> "9e5898528bfedd5484283e7e7656f9bd940f35679c91bdd08832f77ea33a2c64"
>     }
>   },

Sure.  I went with the whiteboard drawing we initially came up with, but this makes sense.

> 3) Call "extra" something like "instance" or "environment" instead
>
> The "extra" section stores instance metadata, e.g.:
>
>   "extra": {
>     "publicIpAddress": "12.34.56.78",
>     "privateIpAddress": "87.65.43.21",
>     "instanceId": "i123456789",
>     "instanceType": "c3.2xlarge",
>     "region": "us-west-2a"
>   }
>
> Should we name this "instance" or "environment" instead? In the case of
> docker-worker, it also has imageHash. I think "extra" is a bit vague, like
> "misc" or "other" it doesn't really capture what it represents, so could
> become a bit of a dumping ground.

The initial idea behind `extra` in the task definition was for task-specific stuff we didn't want to have to allow for in the schema; that's the same thing here.  Worker-specific info can go in here, and we don't have to change the schema.  Right now it's all environment info, and changing the name makes sense for the short term.  That may not always be the case, although I'm not coming up with any specific examples off the top of my head.  I'm on the fence on this one.

If we version the schema, which we probably should anyway, then we could add or rename in version 2, which allows us to be more specific in version 1.  In that case it would be much clearer that we could rename `extra` to `environment`, `instance, `workerEnv`, or something else more specific.

> 4) No imageHash in generic worker certificate
>
> Just want to highlight for the record that imageHash doesn't exist in
> generic worker certificate.json (although Aki I see you'd already thought
> about this in comment 1).

This is known.  `extra` or `environment` or whatever it's called will look different per class of worker.

> 5) JSON Schema for certificate
>
> Will we publish a json schema to validate the certificate.json against? This
> may be useful for stemming problems at source. We could publish it somewhere
> under https://schemas.taskcluster.net/ where we have all the other ones.

I took a stab at this, for validation.
https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/data/firefox_cot_schema.json

It errs on the side of more-permissible-than-needed rather than more-strict-than-needed; I'll be adding more programmatic checks for validation in the python code.

I put 'firefox' in the name because we're already planning on signing addons and servo which will likely have different validation rules.  The schema for the chain of trust artifact itself should stay the same, as long as they're using the same workers, unless I start enforcing rules about what's in the task definition.

Certainly, publishing this, or a more polished version, in an official TaskCluster location sounds like a good idea.

> 6) Generating keypairs on workers
>
> I've exposed a generic-worker subcommand to generate a public/private
> opengpg keypair. I anticipate that this will be called during AMI creation
> for a given worker type, and this will be managed by Release Engineering. I
> did not make it part of the generic-worker install subcommand since the
> certificate creation might be outside of the installation workflow. In the
> gw codebase I've included an example of calling it. By detaching it from
> installation, it also means a keypair can be created outside of generic
> worker entirely, if required. Currently the command saves the private key,
> and writes the public key to standard out.

For key management, I wrote https://github.com/taskcluster/taskcluster-docs/pull/123#discussion_r77671600  I should probably send this to the various email lists for feedback.
The pubkeys will be consumed by scriptworker.  Who creates the AMIs for generic-worker now?  For docker-worker I think that's the TC team.  For the hardware boxen I imagine relops will have a role.
STDOUT probably works, though we have to be careful because it's temporal.  We'll have to figure out details about how we collect that; if we lose it we'll have to regenerate the key.

> 7) Signing OpenPGP keys
>
> Does the generic worker need to be concerned with any parties signing the
> key it creates, or will that all be external to the generic worker?

Once we have the pubkey, that's a separate problem that I think the above key management comment addresses.

> 8) .sig or .signed extension instead of .gpg for certificate?
>
> I feel that the .gpg extension is misleading for certificate.json.gpg since
> normally this would represent a binary signed file, afaik. Would the
> extension .sig or .signed or something else be more idiomatic, and avoid
> confusion that people expect to be able to run `gpg -d` against the file?

Looks like the default output of `gpg --clearsign FILE` is FILE.asc
We can rename this to something.json.asc if we're in here making changes anyway.

> 9) Alternative name to "certificate" (to avoid confusion with OpenPGP
> certificates)
>
> I feel that the term "certificate" for this json blob is also misleading,
> especially as in openpgp there is already the concept of a certificate,
> which has a different meaning. Since it stores that frozen inputs (e.g.
> imageHash) and frozen outputs (artifact hashes) together with the instance
> metadata, maybe something like executionMetadata.json.signed ?

I tend to agree with `certificate` being overloaded and imprecise.
If we're making changes, I'm open to a rename.  `executionMetadata` doesn't particularly resonate.
If we're going with a rename I'd prefer `chainOfTrust.json.asc` because that name has stuck through this process, but I'm not going to bikeshed too long about it.

> I realise this is a lot of feedback that I gathered while working on the
> implementation, so I look forward to your responses! Let me know if you
> prefer me to set up a meeting instead...

We should probably send out an email thread for feedback before making changes.  We may need to start involving three teams: releng, relops, taskcluster, because the key management part will affect all three.  We also need an owner for the corresponding docker-worker changes if :garndt remains busy.
Flags: needinfo?(aki)
I should be able to help with the docker-worker changes once we define what exactly is changing.  Perhaps someone can open up a docker-worker bug detailing that once agreed upon.
Flags: needinfo?(garndt)
To sum up, it looks like we have the following cot artifact changes across docker-worker, generic-worker, and scriptworker:

1. artifacts now looks like

    "artifacts": {
      "path/to/artifact": {
        "sha256": "..."
      }
      ...
    }

2. Let's add a top level version at the top level.  I'm open about the name.

    "chainOfTrustVersion": 1

3. Rename 'extra' to 'environment'

    "environment": {
      "publicIpAddress": "12.34.56.78",
      "privateIpAddress": "87.65.43.21",
      "instanceId": "i123456789",
      "instanceType": "c3.2xlarge",
      "region": "us-west-2a"
    }

4. Rename 'certificate.json.gpg' to 'chainOfTrust.json.asc'.  I'm open about this name too.

Publishing the JSON schema and key management are their own problem spaces that can be solved independently of these changes.  Is that accurate?
Flags: needinfo?(pmoore)
I suppose we could also bikeshed about certified.log :)
Thanks Aki! I agree with all the changes in comment 11 - I'll implement in generic worker and make a pull request for docker worker (as I'm familiar with the code there now too, and that will save garndt some work).
Flags: needinfo?(pmoore)
(In reply to Aki Sasaki [:aki] from comment #12)
> I suppose we could also bikeshed about certified.log :)

lol. How about task.log?
(In reply to Aki Sasaki [:aki] from comment #11)
> To sum up, it looks like we have the following cot artifact changes across
> docker-worker, generic-worker, and scriptworker:
>    ......

Commit added to open PR to incorporate these changes:
  https://github.com/taskcluster/generic-worker/commit/1c654fe30c31c5409df12402f09d7478c9afc409

Example chainOfTrust.json (from the travis integration tests):
  https://public-artifacts.taskcluster.net/LXnkgilJQwyiJQu_72pecg/0/public/logs/chainOfTrust.json.asc
(In reply to Pete Moore [:pmoore][:pete] from comment #14)
> (In reply to Aki Sasaki [:aki] from comment #12)
> > I suppose we could also bikeshed about certified.log :)
> 
> lol. How about task.log?

Sure.

From channel, for posterity:
11:05 <aki> pmoore: I'm thinking, for try, maybe we use the committed test gpg keys?
11:06 <aki> assuming try AMIs
11:06 <aki> that potentially makes spinning up new try AMIs zero touch, as far as gpg keys are concerned.
11:07 <aki> either that or something in secrets if we don't want to use something where the privkey is published
11:13 — aki adds to bug
11:13 <•pmoore> aki: what about if we have a single key for try that we publish to public key stores. I'm wondering if this makes it easier for devs to trivially verify. Indeed the private key we could stick in taskcluster secrets
11:14 <aki> yes. i think the privkey not being published is good, and sticking with a single minimally-trusted key allows for ease of maintenance and use.  at the least i'll need the pubkey, and allowing other people to verify it sounds good too.
11:19 <•pmoore> :-)
Attachment #8788426 - Flags: review?(wcosta) → review+
We still have task.payload.features.generateCertificate iirc.  Do we want to rename this to task.payload.features.generateChainOfTrust ?  (Sorry about that, just remembered.)
Flags: needinfo?(pmoore)
Hi Aki,

Apologies, I changed this from task.payload.features.generateCertificate to task.payload.features.chainOfTrust without explicitly mentioning it. This is done and merged to master branch for both docker worker (change made in bug 1301401 - not yet deployed) and generic worker (from this bug - also not yet deployed).

Thanks,
Pete
Flags: needinfo?(pmoore)
Awesome! I'll get that flag set in the nightly tasks.
Blocks: 1304582
Attachment #8788424 - Flags: review?(dustin)
Released in generic-worker 5.4.0.
https://github.com/taskcluster/generic-worker/releases/tag/v5.4.0
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1309293
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: