Closed Bug 1375182 Opened 2 years ago Closed 7 months ago

[generic-worker] Copy workerType secrets from generic-worker workerType definitions, to the taskcluster-secrets service

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

All secrets in generic-worker worker type definitions need to be migrated to taskcluster-secrets, so that no confidential information exists in generic-worker worker type definitions.
Blocks: 1375155
No longer depends on: 1375155
Blocks: 1375195
No longer blocks: 1375155
Summary: Migrate all workerType secrets from generic-worker workerType definitions, to the taskcluster-secrets service → [generic-worker] Migrate all workerType secrets from generic-worker workerType definitions, to the taskcluster-secrets service
QA Contact: pmoore
Work continues in dependent bugs.
Component: Generic-Worker → Workers
Assignee: nobody → pmoore

Hey John,

Here is a program I wrote to migrate the content across. It is not beautiful, just because it is a one-time script.

Note, since the structure of a worker type definition request is slightly different to a worker type definition response (e.g. properties lastModified and workerType are not in the request but are in the response) I just used plain map[string]interface{} types for data serialisation/deserialisation.

I also did this to make sure if for any reason there was data in a worker type definition that was for some reason not in the json schema that the client library was built from, that it wouldn't get dropped quietly when we unmarshal it into a go type. Using map[string]interface{} helps to ensure that no properties get dropped on the way.

The program needs either a valid TASKCLUSTER_PROXY_URL environment variable to be set, or TASKCLUSTER_ROOT_URL/TASKCLUSTER_ACCESS_TOKEN/TASKCLUSTER_CLIENT_ID. It can be run with:

$ go run bug1375182_migrate-workertype-secrets.go

Thanks!

Attachment #9043293 - Flags: review?(jhford)
Comment on attachment 9043293 [details]
bug1375182_migrate-workertype-secrets.go

This looks reasonable to me
Attachment #9043293 - Flags: review?(jhford) → review+

Changing the bug title, since we're not ready to delete the secrets from the worker type definitions quite yet - that can only be done once the workers have been upgraded to use the new generic-worker release.

Summary: [generic-worker] Migrate all workerType secrets from generic-worker workerType definitions, to the taskcluster-secrets service → [generic-worker] Copy workerType secrets from generic-worker workerType definitions, to the taskcluster-secrets service

Thanks John!

I'll run it shortly... I'll put a note in #ci first...

(In reply to Pete Moore [:pmoore][:pete] from comment #5)

Thanks John!

I'll run it shortly... I'll put a note in #ci first...

The config update script seems to have run correctly, and worker type definitions and taskcluster secrets have been updated.

Rob, Simon,

I wanted to bring this bug to your attention. You will notice that the public generic-worker config settings from the "secrets" section of the worker type definitions is now duplicated under "userData" section of the worker type definitions, and new taskcluster secrets have been created, called worker-type:aws-provisioner-v1/<workerType> that now contain the secret keys.

The new workers (generic-worker 13.0.2 onwards) will use the userData section and the worker type definitions to retrieve their public config, together with the secrets from the secret service to get their private config, rather than the "secrets" section that generic-worker previously used. This enables us to publish the worker type definitions, which will then be secret-free.

When each worker type is upgraded to use generic-worker 13.0.2 (or later), we can remove the "secrets" section from its worker type definition, but for now the config exists in both sections so that they will work with old and new workers.

@Rob

I'm aware you are already upgrading to generic-worker 12 in bug 1524592. Bug 1375157 is the bug to upgrade to generic-worker 13. I'll put a separate needinfo for you in that bug about whether you prefer us to upgrade to v12 and v13 in separate steps, or if we should do them together in a single upgrade. The needinfo here was mostly to alert you to the new data under "userData" of worker type definitions, so you don't wonder where it comes from, and don't delete it etc, and bug 1375157 is about the associated rollout of the new worker version to OCC managed worker types.

@Simon

I can create a staging worker type for you for servo, if you like, so that you can test installing generic-worker 13.0.2 on it and running some test jobs. Note, version 12 introduced a new requirement, that in addition to creating the openpgp private key required by the chain-of-trust task feature, an additional ed25519 key needs to be created. The generic-worker --help text explains how to create it, but it is essentially the same as the way to create the openpgp key, just a slightly different option:

    generic-worker new-ed25519-keypair      --file ED25519-PRIVATE-KEY-FILE
    generic-worker new-openpgp-keypair      --file OPENPGP-PRIVATE-KEY-FILE

    new-ed25519-keypair                     This will generate a fresh, new ed25519
                                            compliant private/public key pair. The public
                                            key will be written to stdout and the private
                                            key will be written to the specified file.
    new-openpgp-keypair                     This will generate a fresh, new OpenPGP
                                            compliant private/public key pair. The public
                                            key will be written to stdout and the private
                                            key will be written to the specified file.

So the long and short of it is that you need to run both generic-worker new-ed25519-keypair and generic-worker new-openpgp-keypair before running the worker for the first time. I think servo doesn't use the chain of trust feature, but it isn't an optional feature so I'm afraid you will need to have both types of keys on the workers. Let me know if you have any questions!

Flags: needinfo?(simon.sapin)
Flags: needinfo?(rthijssen)

Is docker-worker affected at all? :wcosta manages the AMIs used by the aws-provisioner-v1/servo-docker-worker and aws-provisioner-v1/servo-docker-untrusted worker types.

I apparently do not have scopes to access worker-type:aws-provisioner-v1/servo-* secrets. I assume I don’t need them for AWS worker types as the provisioner will presumably manage by itself to bootstrap an appropriate client ID in new instances. However, what is the way forward for non-AWS workers? My scripts for provisioning them (namely for the proj-servo/macos and proj-servo/docker-worker-kvm worker types) are currently accessing the worker type definition of aws-provisioner-v1/servo-win2016 in order to find appropriate livelog secrets.

I think I have access to create a staging worker type myself. I’ll work on migrating our Windows AMI to generic-worker 13.

Flags: needinfo?(pmoore)

(In reply to Simon Sapin (:SimonSapin) from comment #7)

Is docker-worker affected at all? :wcosta manages the AMIs used by the aws-provisioner-v1/servo-docker-worker and aws-provisioner-v1/servo-docker-untrusted worker types.

The docker-worker secrets were also copied to the secrets service in the attached go program. The code changes to docker-worker to be able to read the secrets from the secrets service is tracked in bug 1375190, and rolling that out to existing docker-worker AWS worker types is tracked in bug 1375192.

I apparently do not have scopes to access worker-type:aws-provisioner-v1/servo-* secrets.

Apologies, I had forgotten to do this. I've since granted worker type secret read and write access for servo admins in role project:servo:grants/secrets for this. Let me know if this resolves the issue for you, or if you are still having problems with access.

I assume I don’t need them for AWS worker types as the provisioner will presumably manage by itself to bootstrap an appropriate client ID in new instances.

That is correct - see role worker-type:*.

However, what is the way forward for non-AWS workers?

This change only affects AWS workers, since the motivation of the change is to get secrets out of the AWS worker type definitions. One positive side effect of this is that it also gets secrets off the workers too, so we could look at making this functionality available generally, so that any workers can bootstrap their config via the taskcluster-secrets service, but that currently isn't the case.

My scripts for provisioning them (namely for the proj-servo/macos and proj-servo/docker-worker-kvm worker types) are currently accessing the worker type definition of aws-provisioner-v1/servo-win2016 in order to find appropriate livelog secrets.

I would recommend that they either get the secret from secret worker-type:aws-provisioner-v1/servo-win2016 or creating a new taskcluster secret or secrets specifically for those workers.

Note, the secret-fetching from the taskcluster-secrets service is only activated if the --configure-for-aws option is set when running the generic-worker. If we made this feature available outside of AWS, it would be possible e.g. to bootstrap the proj-servo/macos workers secret config from taskcluster secret worker-type:proj-servo/macos (assuming they run generic-worker). Let me know if this is desirable, or if you are happy to bootstrap the config outside of generic-worker, before it runs.

As noted above, the primary motivation of this change wasn't to get secrets off the workers, but to get secrets out of the AWS worker type definitions. However, if this is desirable, we can look into it.

I think I have access to create a staging worker type myself. I’ll work on migrating our Windows AMI to generic-worker 13.

Many thanks! Let me know if you get stuck.

Flags: needinfo?(simon.sapin)
Flags: needinfo?(pmoore)
No longer blocks: 1375195
Blocks: 1375157
No longer depends on: 1375157
Blocks: 1375176
No longer depends on: 1375176

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

(In reply to Pete Moore [:pmoore][:pete] from comment #5)

I'll run it shortly... I'll put a note in #ci first...

The config update script seems to have run correctly, and worker type definitions and taskcluster secrets have been updated.

Closing bug as the secrets have been copied. The subsequent discussion was just informational, and any open topics can be tracked in separate bugs.

Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Blocks: 1527613

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

I wanted to bring this bug to your attention. You will notice that the public generic-worker config settings from the "secrets" section of the worker type definitions is now duplicated under "userData" section of the worker type definitions, and new taskcluster secrets have been created, called worker-type:aws-provisioner-v1/<workerType> that now contain the secret keys.
The new workers (generic-worker 13.0.2 onwards) will use the userData section and the worker type definitions to retrieve their public config, together with the secrets from the secret service to get their private config, rather than the "secrets" section that generic-worker previously used. This enables us to publish the worker type definitions, which will then be secret-free.

just a (convoluted) question: does the use of the word "userData" (as the name of the new provisioner config section), indicate that this section will also be passed to spot instances as ec2 userdata? if not, could we consider using some terminology that doesn't conflict with terminology already in use and understood to have a specific meaning? the userData term is used extensively within our ec2 infra and denotes a specific body of text passed to ec2 instances. i am concerned that if we don't intend to pass this new section to instances as ec2 userdata, that using the same term will be unnecessarily confusing. if this section is going to be passed in to instances as ec2 userdata, i'm fine with the terminology but this raises questions about whether the format of the current data passed as ec2 userdata is going to change. occ (and probably other tools) currently reads and parses ec2 userdata that is significant to the configuration logic of running instances, so if generic worker has just claimed ec2 userdata as its own configuration space, without taking in to account how this space is already used, we'll need to modify existing tooling to handle the loss of this feature currently relied on.

I'm aware you are already upgrading to generic-worker 12 in bug 1524592. Do you prefer to continue with that, and I make a separate gw 13 PR, or do you prefer to jump straight to version 13 and skip version 12 entirely? I don't mind either way.

it depends on the answer to my question above. if ec2 userdata is being changed so that we can no longer rely on the data we used to parse from there, then i'd prefer a new bug for gw 13 because it will represent a new body of work to modify occ to accommodate for this change.

if ec2 userdata is not changing and we don't need to modify occ to accommodate, then it's a trivial change to modify the scope of bug 1524592 to encompass upgrading to gw 13 instead of 12.

Flags: needinfo?(rthijssen) → needinfo?(pmoore)

(In reply to Rob Thijssen [:grenade (EET)] from comment #10)

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

I wanted to bring this bug to your attention. You will notice that the public generic-worker config settings from the "secrets" section of the worker type definitions is now duplicated under "userData" section of the worker type definitions, and new taskcluster secrets have been created, called worker-type:aws-provisioner-v1/<workerType> that now contain the secret keys.
The new workers (generic-worker 13.0.2 onwards) will use the userData section and the worker type definitions to retrieve their public config, together with the secrets from the secret service to get their private config, rather than the "secrets" section that generic-worker previously used. This enables us to publish the worker type definitions, which will then be secret-free.

just a (convoluted) question: does the use of the word "userData" (as the name of the new provisioner config section), indicate that this section will also be passed to spot instances as ec2 userdata? if not, could we consider using some terminology that doesn't conflict with terminology already in use and understood to have a specific meaning? the userData term is used extensively within our ec2 infra and denotes a specific body of text passed to ec2 instances. i am concerned that if we don't intend to pass this new section to instances as ec2 userdata, that using the same term will be unnecessarily confusing. if this section is going to be passed in to instances as ec2 userdata, i'm fine with the terminology but this raises questions about whether the format of the current data passed as ec2 userdata is going to change.

Yes, the userData section of the worker type definition is passed to spot instances via ec2 userdata. The ec2 userdata is set by the AWS Provisioner, and is a json object. The content of the userData property of the worker type definition is passed via the data property of that json object. This isn't a change, this has always been the case. In fact, it is a little more sophisticated - userData properties can be set at a region level, an AZ level, an instance-type level, and globally. These objects get merged, and passed in via the data property in the ec2 userdata json object. See the AWS provisioner docs.

occ (and probably other tools) currently reads and parses ec2 userdata that is significant to the configuration logic of running instances, so if generic worker has just claimed ec2 userdata as its own configuration space, without taking in to account how this space is already used, we'll need to modify existing tooling to handle the loss of this feature currently relied on.

As noted above, that is not the case.

I'm aware you are already upgrading to generic-worker 12 in bug 1524592. Do you prefer to continue with that, and I make a separate gw 13 PR, or do you prefer to jump straight to version 13 and skip version 12 entirely? I don't mind either way.

it depends on the answer to my question above. if ec2 userdata is being changed so that we can no longer rely on the data we used to parse from there, then i'd prefer a new bug for gw 13 because it will represent a new body of work to modify occ to accommodate for this change.

if ec2 userdata is not changing and we don't need to modify occ to accommodate, then it's a trivial change to modify the scope of bug 1524592 to encompass upgrading to gw 13 instead of 12.

Thanks. In that case, if you are happy with the answers above, we can resolve bug 1375157 as a duplicate of bug 1524592, and extend the scope of bug 1524592 to encompass v13 too.

Flags: needinfo?(pmoore)
You need to log in before you can comment on or make changes to this bug.