Closed
Bug 1145695
Opened 10 years ago
Closed 9 years ago
[provisioner] store secrets
Categories
(Taskcluster :: Services, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jhford, Unassigned)
References
Details
We should:
- create a new secrets entity
- create a new endpoint to fetch secrets
- create a new endpoint to delete secrets, ensure that it fails when the secrets
can't be deleted
- store secrets in the new table with a slugid reference
- insert slugid of secrets in the UserData passed to worker
The flow is that a worker will get the token through UserData. That token will be posted to the fetch endpoint to get the secrets. The worker will then post the delete endpoint to ensure that no process it starts up later is able to access it by looking up UserData.
The provisioner can use this information to know when a worker is ready to accept jobs.
The provisioner could also use this to know to hard kill a machine if it hasn't claimed its secrets in >1 hour. We can assume that if a node doesn't have its secrets that it's probably stuck booting or something.
Comment 1•10 years ago
|
||
Sorry this comment got long. The essentials us under (C).
> The provisioner could also use this to know to hard kill a machine if it hasn't claimed
> its secrets in >1 hour. We can assume that if a node doesn't have its secrets that it's
> probably stuck booting or something.
It's not quite that simple, as spot requests can sit around for a while, and you should create
the entry in the secrets table immediately before or after you create the spot request.
I add the following:
1) Entries in the secrets table, must have an "expires" property.
2) Provisioning loop must scan secrets table for entries where ("expires" <= new Date() - 1 hour)
and ensure that any associated spot requests are cancelled.
3) Provisioning loop must scan secrets table fro entries where ("expires" <= new Date()) and ensure
that any associated spot requests and ec2 instances are cancelled/terminated, before the entity
is deleted (on failure to cancel/terminate spot-request/instance should either exist, then log
error with [alert-operator] and don't delete the secret entity, hence, we retry on next iteration)
Ordering of entry in secrets table and spot-request:
A) Create secrets entry, then spot-request:
- secrets entry is guaranteed to be there
- if spot request fails, secret entry will expire and be deleted (cleaning up state)
B) Create spot request, then secrets entry:
- secrets entry may not be present (if creation failed), docker-worker should be able
to handle missing secrets case gracefully, but logging [alert-operator] and gracefully
terminating at end-of-billing cycle.
(this is hard to implement, as we might stuff logging credentials in secrets)
- We can embed spot-request id in the secrets entry (this is nice)
Side note for even better design:
Spot requests have a "ValidUntil" date, we should set this to "now() + 30min", and
let entries in secrets table expire "now() + 90min".
This way we can avoid implementing step (2) where we cancel spot requests, before
expires is timed out.
C) Create secrets entry, create spot request, update secrets entry:
- secrets entry is guaranteed to be there
- We we have the spot-request id in the secrets entry in most cases
Implementation:
1) Create secrets entry with:
token: slugid.v4() // new random slugid
data: {...} // secrets
expires: now() + 90min
spotRequestId: null
2) Create spot-request with:
ValidUntil: new() + 30min
userData: {token: '...', ...} // embed token in user-data
3) Update secrets entry with spotRequestId from request, and hasSpotRequestId = 'true'
(Notice that we maybe die between any of these steps!)
In each iteration of provisioning algorithm:
// Use base.Entity.query with server-side filter
foreach secrets table entry where "expires" <= new Date():
if entry.spotRequestId:
Check if request still exists in AwsManager, if so cancel it
(The spotrequest shouldn't be open because we set ValidUntil)
Check if there is an instance which was created from this spotRequestId,
Again use AwsManager to check this, if so terminate the instance, as it failed
to get and delete secret data.
entry.remove() // delete entry
Note: If we die between step (2) and step (3), we shall fail to terminate instances
that haven't done get secret and delete secret. But according to garndt, it is
very rare that instances fails to start. It's a super special corner case, that
only happens because of unavailable network device issues or something like that.
---
(C) isn't perfect, and we might be able to do better and simpler.
Consistency and robustness is key here.
Reporter | ||
Updated•10 years ago
|
Blocks: tc-2015-q2
Comment 2•10 years ago
|
||
John and I started looking at this last week while John was over in Neuss. Work in progress is here: https://github.com/taskcluster/taskcluster-secrets/tree/first-pull-request
Comment 3•10 years ago
|
||
I think the provisioner secrets is different from the taskcluster-secrets idea which aims at putting secrets in tasks as an alternative to encrypted env cars.
I could be wrong and we might have some miscommunication.
For provisioner secrets see that as part of the provisioner, because the provisioner wants to use the fact that the worker called back to record statistics and possibly update internal state so I can track nodes that fails to start.
Comment 4•10 years ago
|
||
I saw Vault got announced just the other day, worth taking a look at vs. building our own secrets store:
https://www.vaultproject.io/
Comment 5•9 years ago
|
||
I guess we can close this now thanks to John's hard work!
John, can you confirm all the requirements in comment 1 have been addressed?
Thanks!
Pete
Flags: needinfo?(jhford)
Reporter | ||
Comment 6•9 years ago
|
||
yep! closing now
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jhford)
Resolution: --- → FIXED
Updated•9 years ago
|
Component: TaskCluster → AWS-Provisioner
Product: Testing → Taskcluster
Assignee | ||
Updated•6 years ago
|
Component: AWS-Provisioner → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•