If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

generic-worker fails cot generation when "signingKeyLocation": "not-configured"

NEW
Unassigned

Status

Infrastructure & Operations
RelOps
3 months ago
3 months ago

People

(Reporter: aki, Unassigned)

Tracking

Details

(Reporter)

Description

3 months ago
STR:

- create a generic-worker task that has task.payload.features.chainOfTrust set to True
- run it on a workerType that has a signingKeyLocation that is not-configured

Expected:

- generic worker generates and uploads the chain of trust artifact.  This can either be unsigned, or signed with a key that isn't marked as valid in the cot-gpg-keys repo.

Actual:

- generic worker hits a retry loop

15:21 <•garndt> Callek: this seems to be related to certificate of trust issues....did you test this out on try and those builds worked on a gecko-1-b-win2012 worker type?  
15:22 <•garndt> basically what I'm finding is the worker is panicing now finding the cot signing key...and it loads that from configuration....and in the worker type definition for all 3 builders I see ""signingKeyLocation": "not-configured""
15:23 <•garndt> I'm not exactly sure what should be set there, but I think that's the problem
15:24 <aki> ah, generic worker should either have a default untrusted key, or upload an unsigned cot artifact in that case
15:24 <aki> i can file a bug
15:25 <•garndt> basically it's barfing here: https://github.com/taskcluster/generic-worker/blob/v8.2.0/chain_of_trust.go#L134-L137
15:26 <•garndt> based on this stack:
15:26 <•garndt> https://www.irccloud.com/pastebin/hTn2IkN5/
(Reporter)

Comment 1

3 months ago
Ah, these jobs were run on gecko-3-b-win2012, which actually should have had their key locations configured:

https://tools.taskcluster.net/task-inspector/#Y0u2VL5KQ5mfEP8T8MCRDQ/

That's a 2nd thing that's broken here.
Assignee: nobody → relops
Component: Generic-Worker → RelOps
Flags: needinfo?(rthijssen)
Product: Taskcluster → Infrastructure & Operations
QA Contact: arich
pmoore: how does gw handle the cot key? if i configure the location of an ascii armoured private key in the gw config, will it know what to do with it?
Flags: needinfo?(rthijssen)
key location set for gecko-(1,2,3)-b-win2012
Flags: needinfo?(pmoore)
i retriggered the job with the key location set in config and it appears to have gone green:
https://treeherder.mozilla.org/#/jobs?repo=date&revision=aa94203ca829fa2ab210b6953d459a203b5f4e6a&selectedJob=106186258
Flags: needinfo?(pmoore)
Thanks Rob!

Aki, I think we can close this. According to the docs, generic-worker requires a gpg key (it isn't optional) and also provides a mechanism to generate one[1] if required (of course a custom one generated by a different means can be provided if preferred). If the worker finds an invalid configuration setting at runtime, or finds a valid filename configured, but that file contains e.g. an invalid key, I think it is correct for the worker to report the incident, and then shut down. This gives another worker the chance to take the job. In general I think it is reasonable for a worker to expect a valid configuration, and to shut down (fail fast) if it discovers it has an invalid setting. For example, it might not know if the file path was a valid path, but the file system was compromised, or if the file path was never valid, etc.

If there is a retry mechanism as play that we wish to disable, that should probably lie outside the responsibility of the worker. However, the worker does not know if it is a faulty worker (and others will work) or if all workers are just incorrectly configured. Therefore, we have only 5 retries, which limits the amount of wasted resources already. Therefore I don't think we need any further special handling around this to prevent retries.

Lastly, this should only be a problem when first setting up a worker type - once we have a valid config, we are good-to-go.

--

[1] https://github.com/taskcluster/generic-worker/blob/3c0be1da3301b35dcc424bace940e49dddfe7bd1/main.go#L87-L90
Flags: needinfo?(aki)
(Reporter)

Comment 6

3 months ago
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #3)
> key location set for gecko-(1,2,3)-b-win2012

Thanks!
Only gecko-3-b-win2012's key is in cot-gpg-keys, correct?

(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> Thanks Rob!
> 
> Aki, I think we can close this.

Once we clarify the above question, I agree.  We can reopen or file new if there are additional issues.
Flags: needinfo?(aki) → needinfo?(rthijssen)
all three public keys are in the repo: https://github.com/mozilla-releng/cot-gpg-keys/tree/master/generic-worker
Flags: needinfo?(rthijssen)
(Reporter)

Comment 8

3 months ago
Ok. This means Try- and project- windows boxes are able to generate release-valid builds atm.

I'm going to remove the level 1+2 gpg keys from the repo, so the cot artifacts from these will be signed, but not valid.  This will match the docker-worker configuration.
(In reply to Aki Sasaki [:aki] from comment #8)
> I'm going to remove the level 1+2 gpg keys from the repo, so the cot
> artifacts from these will be signed, but not valid.  This will match the
> docker-worker configuration.

So long as there is a valid openpgp key on the worker that the worker is configured to use, it will use it to generate the public/chainOfTrust.json.asc artifact. If the configuration does not specify a valid openpgp key, the worker will self-terminate.

I remember at one point we discussed the possibility of level 1 workers sharing a static private key, whose public counterpart would be published to gpg.mozilla.org - is that still the case? The idea was that it still offered some basic protection, and had the advantage that only a single key needed to be managed across all worker types and deployments (with the option of regenerating if it was compromised). Would this also be suitable for level 2?
Flags: needinfo?(aki)
(Reporter)

Comment 10

3 months ago
Tl;dr: We can do that, especially if it simplifies the deployment of generic-worker, but it's not currently supported by scriptworker.

Since we currently control valid/non-valid via gpg homedirs, we have 4 gpg homedirs atm: 1 for validating the gpg repo, one for scriptworker, one for generic-worker, and one for docker-worker. We'll have a fifth if and when we support taskcluster-worker. If we added level1/2 keys, we'd need at least one more homedir, and possibly one more per worker impl.

We don't have keyserver support, and I'm not sure we're going to. We could add that, but if the key is static or only rotated every number of months or years, adding it to the repo with special rules could work.

> The idea was that it still offered some basic protection

What use case are you seeing here? Someone MITMs a try or project branch that's signed with a dep (non-nightly non-release) key, and convinces someone else to install it?
Flags: needinfo?(aki)
(Reporter)

Comment 11

3 months ago
(In reply to Aki Sasaki [:aki] from comment #10)
> > The idea was that it still offered some basic protection
> 
> What use case are you seeing here? Someone MITMs a try or project branch
> that's signed with a dep (non-nightly non-release) key, and convinces
> someone else to install it?

Since both try and project branches don't require review to land, at best CoT on these branches would let us know that the graph was generated through valid in-tree decision task processes, but with unreviewed code. I'm not sure that gives us any protection, but I may be missing something.
(In reply to Aki Sasaki [:aki] from comment #11)
> Since both try and project branches don't require review to land, at best
> CoT on these branches would let us know that the graph was generated through
> valid in-tree decision task processes, but with unreviewed code. I'm not
> sure that gives us any protection, but I may be missing something.

The use case I was considering is that a try user wants to be sure (for himself/herself) that an artifact he sees in the task inspector hasn't been tampered with. This could only be tampered with if an attacker managed to get into the sandbox of the running task (i.e. to gain TC credentials and access to the signing key) or if attacking from outside the sandbox, the attacker would need to attain both valid credentials to upload task artifacts and resolve the task, and the private try key, in order that the chain of trust file has a valid signature (which they could validate independently using a pgp client).

The added security was just that the private try key would also need to be compromised, as well as taskcluster credentials, so just makes an attack harder. This was more about securing try push results for casual users, rather than for securing release builds. Although probably not extremely useful, it could help provide confidence to try users that wish to validate the signatures independently.
(Reporter)

Comment 13

3 months ago
Sure, that makes sense.

Try also runs the risk of having Mallory-with-level1-privs launch a large number of tasks to try to compromise the private key. This is possible on level 3 as well, but with more eyeballs and a smaller number of users. Once a level1 worker privkey is compromised, all chains that see that key as valid are suspect.

We can try to mitigate this in various ways.

If we add this to generic-worker, we still won't be able to verify the full chain until docker-worker and the upcoming scriptworker dep pool also have verifiable level1 keys.
You need to log in before you can comment on or make changes to this bug.