Closed Bug 1145695 Opened 8 years ago Closed 7 years ago

[provisioner] store secrets

Categories

(Taskcluster :: Services, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Unassigned)

References

Details

We should:

 - create a new secrets entity
 - create a new endpoint to fetch secrets
 - create a new endpoint to delete secrets, ensure that it fails when the secrets
   can't be deleted
 - store secrets in the new table with a slugid reference
 - insert slugid of secrets in the UserData passed to worker


The flow is that a worker will get the token through UserData.  That token will be posted to the fetch endpoint to get the secrets.  The worker will then post the delete endpoint to ensure that no process it starts up later is able to access it by looking up UserData.

The provisioner can use this information to know when a worker is ready to accept jobs.

The provisioner could also use this to know to hard kill a machine if it hasn't claimed its secrets in >1 hour.  We can assume that if a node doesn't have its secrets that it's probably stuck booting or something.
Sorry this comment got long. The essentials us under (C).

> The provisioner could also use this to know to hard kill a machine if it hasn't claimed
> its secrets in >1 hour. We can assume that if a node doesn't have its secrets that it's
> probably stuck booting or something.
It's not quite that simple, as spot requests can sit around for a while, and you should create
the entry in the secrets table immediately before or after you create the spot request.

I add the following:
 1) Entries in the secrets table, must have an "expires" property.
 2) Provisioning loop must scan secrets table for entries where ("expires" <= new Date() - 1 hour)
    and ensure that any associated spot requests are cancelled.
 3) Provisioning loop must scan secrets table fro entries where ("expires" <= new Date()) and ensure
    that any associated spot requests and ec2 instances are cancelled/terminated, before the entity
    is deleted (on failure to cancel/terminate spot-request/instance should either exist, then log
    error with [alert-operator] and don't delete the secret entity, hence, we retry on next iteration)

Ordering of entry in secrets table and spot-request:
A) Create secrets entry, then spot-request:
   - secrets entry is guaranteed to be there
   - if spot request fails, secret entry will expire and be deleted (cleaning up state)

B) Create spot request, then secrets entry:
   - secrets entry may not be present (if creation failed), docker-worker should be able
     to handle missing secrets case gracefully, but logging [alert-operator] and gracefully
     terminating at end-of-billing cycle.
     (this is hard to implement, as we might stuff logging credentials in secrets)
   - We can embed spot-request id in the secrets entry (this is nice)

Side note for even better design:
  Spot requests have a "ValidUntil" date, we should set this to "now() + 30min", and
  let entries in secrets table expire "now() + 90min".
  This way we can avoid implementing step (2) where we cancel spot requests, before
  expires is timed out.

C) Create secrets entry, create spot request, update secrets entry:
   - secrets entry is guaranteed to be there
   - We we have the spot-request id in the secrets entry in most cases
   Implementation:
     1) Create secrets entry with:
           token:             slugid.v4()   // new random slugid
           data:              {...}         // secrets
           expires:           now() + 90min
           spotRequestId:     null
     2) Create spot-request with:
           ValidUntil:      new() + 30min
           userData:        {token: '...', ...} // embed token in user-data
     3) Update secrets entry with spotRequestId from request, and hasSpotRequestId = 'true'
        (Notice that we maybe die between any of these steps!)

     In each iteration of provisioning algorithm:
        // Use base.Entity.query with server-side filter
        foreach secrets table entry where "expires" <= new Date():
           if entry.spotRequestId:
             Check if request still exists in AwsManager, if so cancel it
             (The spotrequest shouldn't be open because we set ValidUntil)
             Check if there is an instance which was created from this spotRequestId,
             Again use AwsManager to check this, if so terminate the instance, as it failed
             to get and delete secret data.
           entry.remove() // delete entry

     Note: If we die between step (2) and step (3), we shall fail to terminate instances
           that haven't done get secret and delete secret. But according to garndt, it is
           very rare that instances fails to start. It's a super special corner case, that
           only happens because of unavailable network device issues or something like that.

---
(C) isn't perfect, and we might be able to do better and simpler.
Consistency and robustness is key here.
John and I started looking at this last week while John was over in Neuss. Work in progress is here: https://github.com/taskcluster/taskcluster-secrets/tree/first-pull-request
I think the provisioner secrets is different from the taskcluster-secrets idea which aims at putting secrets in tasks as an alternative to encrypted env cars.
I could be wrong and we might have some miscommunication.

For provisioner secrets see that as part of the provisioner, because the provisioner wants to use the fact that the worker called back to record statistics and possibly update internal state so I can track nodes that fails to start.
I saw Vault got announced just the other day, worth taking a look at vs. building our own secrets store:
https://www.vaultproject.io/
I guess we can close this now thanks to John's hard work!

John, can you confirm all the requirements in comment 1 have been addressed?

Thanks!
Pete
Flags: needinfo?(jhford)
yep!  closing now
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(jhford)
Resolution: --- → FIXED
Component: TaskCluster → AWS-Provisioner
Product: Testing → Taskcluster
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.