Closed Bug 1336050 Opened 7 years ago Closed 7 years ago

[tracker] create and deploy host-secrets service

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: dustin)

References

()

Details

Attachments

(2 files)

61 bytes, text/x-github-pull-request
jhford
: review+
Details | Review
62 bytes, text/x-github-pull-request
dustin
: review+
grenade
: feedback+
Details | Review
clients request temporary credentials and the service should validate reverse dns (ip/mac) and issue temporary scopes.
Assignee: relops → rthijssen
Hi Rob, I've started a simple project for this here: https://github.com/jhford/taskcluster-hardware-secrets I can maybe hack on this a bit more, but I'm also happy to do reviews if you have time to hack on it.

This is mostly a rough skeleton for how this could be implemented using TC Apis and tooling.  Doing that makes it a lot easier for this service to integrate from an admin standpoint.  It also means that it works similarly to other taskcluster services.

The things that need to be done still:

1. Docs.  They're placeholders now, but need to be written
2. An implementation of src/api.js:isIpAllowed(ip, allowed)
3. Re-enable scopes (turned off for easier development)
4. Write unit tests -- basically reimplement test.sh in a better way
5. JSON Schemas for input and output
6. Design structure for allowed field of the Secret entity

To run what I've done so far, you can clone the repo then 'npm install'.  The server should be able to be started with:

npm run compile && NODE_ENV=development PORT=8080 node ./lib/main development

and then in another terminal, running test.sh after changing the host names to be not my work desktop :)

Regarding number 2. above, I'm not an expert on matching IPs and DNS/Reverse-DNS stuff.  The structure of the allowed field is something I'm not sure how to do.

Let me know if you have any questions
This is awesome John. Thank you!
Nice!  It's exciting to think we could build something like this so quickly!

To summarize my understanding from looking at the repo, including
comments at the top of api.js:

 * CRUD endpoints for updating named secrets, using normal auth scopes
 * getting a secret allows either taskcluster, per-IP, or per-token
authentication
 * secrets will have an "allow" configuration that determines what IPs
and/or tokens can get them.

Did I miss anything?

Storing secrets directly in this service seems to duplicate the work
of the secrets service.  Could we design a service that just issues TC
credentials, which hosts can then use to fetch secrets (or run a
worker, or anything else)?

I suspect such a service could operate with no Azure tables.  It would
just issue temporary credentials with a role based on the requesting
hostname.  If this occurred on every reboot, then the 30-day maximum
for temporary credentials wouldn't be a problem.  Then the role could
determine which secrets are available to which hosts, in an
inspectable fashion.  Something like
`assume:project:releng:host:com.mozilla.scl3.releng.build.b-3-win2012-0010`,
with scopes assigned to prefixes like
`project:releng:host:com.mozilla.scl3.releng.build.b-3-*`

I can definitely help with the forward/reverse DNS resolution.  I can
help with deployment, too.
A few notes from our conversation just now:

 - the factoring in comment 3 probably does make sense: issue credentials, and let the host get secrets from the secrets service
 - in addition to forward/reverse DNS, we should also whitelist IP ranges and require a shared token as additional protection measures
 - this will need to be deployed within the releng network; perhaps we could deploy it on the puppetmasters, which are already high-security, redundant, and well-maintained

And it occured to me after our conversation, we don't have a name!  Ideas: `datacenter-auth`, `host-credentials`, `datacenter-credentials`..
Oh, "hardware-secrets", from the repo in comment 1, is good.  So we do have a name :)
I threw together some code to verify forward/reverse DNS:
  https://gist.github.com/djmitche/6977bd19fb37b99510af4fb4c9a1dc2c

I don't think it makes sense to release this as an actual npm library -- it was just easiest to set it up and run it that way.  I admit I didn't write a lot of tests, because I can't find any public IPs that don't have good forward/reverse.  In production, we can write a fake dns module and test against that.
One thought about the name is that I'd like to not let hardware diverge too far from our cloud instance configuration. I'm thinking that perhaps our existing Windows cloud instances should be adapted to also use this service to obtain their secrets. If we did that, it might make sense to lose the "hardware" element in the service name. since taskcluster-secrets would be confusing, perhaps taskcluster-secrets-proxy?
Some thoughts on how we might implement secret acquisition from hardware in the dc:

- on boot, hardware-instance checks if it has an unexpired gpg key, if not it creates one and publishes to mozilla keyserver or some new git repo that accepts commits from untrusted sources
- something monitors the keyserver/git-repo for newly added keys and alerts in an IRC channel
- human monitors IRC channel and adds public key (if trusted) to metadata-service-encryption-key-list
- datacentre-metadata-service publishes metadata-secrets as .gpg files encrypted with all public keys in the metadata-service-encryption-key-list
- hardware-instance obtains encrypted taskcluster-secrets-token from datacentre-metadata-service
- hardware-instance decrypts taskcluster-secrets-token and uses it to access taskcluster-secrets-proxy
- taskcluster-secrets-proxy validates token and reverse dns and if appropriate makes taskcluster-secrets available to hardware-instance

hardware-instance: a hardware instance in one of our datacentres
human: anyone in the release operations team (maybe releng/buildduty/taskcluster too?)
datacentre-metadata-service: a service hosted at ip 169.254.169.254 inside the datacentre and on the same local network as the hardware-instance
metadata-service-encryption-key-list: a list of trusted hardware-instance public keys
taskcluster-secrets-token: a shared token trusted by the taskcluster-secrets-proxy
metadata-secrets:
 - root user credentials for hardware-instance
 - worker user credentials for hardware-instance
 - taskcluster-secrets-token
 - build/test secrets (oauth and api keys used by builds and tests)
taskcluster-secrets:
 - livelog credentials
 - generic-worker configuration secrets
dustin, jhford, markco: do the ideas above gel with how you envisaged this to work?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(jhford)
Flags: needinfo?(dustin)
In general I like it. 

The only part I am concerned with is the "a service hosted at ip 169.254.169.254". That part might get a little sticky network wise.
Flags: needinfo?(mcornmesser)
We could certainly run the same service in EC2 and in the datacentre.  But the situation is a little different in each:
 - in the datacenter, there's a strong association between IP and role; in EC2, the IP is basically random
 - in the datacenter, re-deploying hosts from scratch is slow and time-consuming, so we can't make a new token on every startup, whereas in EC2 re-deploying is the default and we can issue a distinct token to each instance.  I think we have a godo thing going in the AWS provisioner, and we should probably not try to re-solve that problem right now.

I tend to agree about "hardware" in the name.  Maybe "host-credentials" was better.

I don't think this service covers the gpg key generation - at least not directly.  Perhaps it provides access to some required secrets (maybe an ssh key that can push to this git repo..) in the secrets API.

Also, agreed wrt use of 169.254.169.254.  That will be problematic in the datacenter (requiring some tricky networking config on the server and special-casing on the routers) and impossible in Amazon (where EC2 already provides a service at that IP).  I see a fixed IP like that as being useful for early system startup, where the system is getting its first taste of its unique data.  This service, on the other hand, comes later in the startup process and clients can use a regular old domain name to find it.
Flags: needinfo?(dustin)
i probably wasn't clear enough. my thinking is to have two services. one is the host-credentials/tc-secrets-proxy providing access to taskcluster secrets and another separate service providing metadata in the dc. no need for the second service (metadata) in ec2 where its provided already.
and yes, the use of the ip 169.254.169.254 in the dc is necessarily complicated but intentional since it's conventionally used to provide just this sort of metadata.
johnb: It was suggested that you might be the person to talk to in regards of using the 169.254.169.254 ip address with in the datacenter. Refer to the discussion above starting at comment 8. 

Would this be feasible in regards of using the address? If so are there any pitfalls you see in using it?
Flags: needinfo?(jbircher)
It looks like we are going to move away from trying to use the 169 address.
Flags: needinfo?(jbircher)
I've made a few changes to the tool I put in the comments earlier.  I don't think setting up a proxy is how we'd want to do this, rather, I think a good approach is use this service to generate a set of taskcluster credentials which can be used with the taskcluster-client libraries (js, python, go and java) to interact directly with taskcluster-secrets to obtain secrets.  Taskcluster-secrets could be used to store whichever secrets are involved in starting up the machine and getting it into production.

Rob, regarding matching the cloud-setup, I think the issue is that we don't want to duplicate what's done in the cloud.  I think what we really want is to move the cloud to using basically what we have here.  In the cloud, the work that this tool does would be implemented inside of the provisioner and in the future, we'd remove the secrets storage from the provisioner altogether.

I think that the metadata service is a cool idea, but should be unrelated to the auth/scopes/credentials issue.  The concern I have is that it's an unauthenticated service in EC2, and we definitely don't want to have an unauthenticated endpoint that supplies credentials.

Dustin, do you think you could open a PR implementing that check in the tool?  https://github.com/jhford/taskcluster-hardware-secrets/blob/master/src/api.js#L14-L16 is where I currently have stub code.  I'm not sure that we care to have an IP whitelist, but it might be nice as an extra layer.  The format for that is not something I'm sure of, whether it'd be simple wildcard matching or something smart based on cidr. The https://www.npmjs.com/package/ip library looks pretty neat for doing these checks.  If you don't mind taking a quick look at the tool in general, I'd appreciate that.

As things are currently implemented, I issue scopes for all DNS reverse resolution records, but based on the snippet, I should only do it if there's a single record.  I also haven't written tests yet, but that shouldn't be too much work.
Flags: needinfo?(rthijssen)
Flags: needinfo?(jhford)
Flags: needinfo?(dustin)
i'll work with whatever you build. my initial motive for the metadata service evaporates if we don't need to supply a token/secret to your service but can rely on just the reverse dns for auth. so when it's ready, we'll take the service for a spin and see how it goes.
Flags: needinfo?(rthijssen)
Attached file pr 1 - ip checking
John, there was no way to flag you on your own repo for this PR, so figured I would add the flag here for it.  Mind giving it a gander?  Thanks!
Attachment #8838596 - Flags: review?(jhford)
Comment on attachment 8838596 [details] [review]
pr 1 - ip checking

Merged.

The only thing left here is the IP checking, should we choose to do that, and some unit tests of the API itself.  I'll work on that this week, and then coordinate with Rob to make sure it's working correctly.
Attachment #8838596 - Flags: review?(jhford) → review+
Attached file PR 2
Here's another PR to add a few features to the service
Attachment #8839230 - Flags: review?(dustin)
Attachment #8839230 - Flags: feedback?(rthijssen)
Attachment #8839230 - Flags: review?(dustin) → review+
We had an irc conversation a few hours ago regarding how to prevent a task running on a host from getting access to that host's credentials.

The risk from such an attack is that the credentials give access to a number of secrets that are shared with lots of other things, notably the *.taskcluster-worker.net TLS key, relengApi proxy token, etc.  Sort of "medium" risk.

Some fixes we considered:

 * Share a single token among all hosts, and pass that along with the credentials request
   * Protect that token with filesystem permissions
   * Install that token with Puppet/OCC on startup, and delete it as soon as it is used
   * Generate a JWT with Puppet/OCC that lists an expiration time 10 minutes or so in the future
     (it's not clear this is possible with OCC)
 * Configure the host firewall to prevent outbound access to the credentials service after credentials are fetched

We also noted that in cases where taskcluster-worker is running as the same user as the task (which is the case on OS X and probably all platforms), all the protections in the world can't stop a task from reading /proc/$PPID/mem and looking for the credentials there, at least not without preventing crashreporter from working or tests from running.

So, maybe we just have to live with this risk?  I feel like we should pull in some more opinions on the topic..
Depends on: 1341654
Comment on attachment 8839230 [details] [review]
PR 2

working to deploy on puppet servers - bug 1341654
Attachment #8839230 - Flags: feedback?(rthijssen) → feedback+
I think we're agreed we need to live with the risk.

We also noted that we should be using HTTPS, bug 1342112.

At any rate, this is implemented now, but seems to be turning into a tracker.
Summary: create an auth service to issue temporary scopes to hardware taskcluster workers → [tracker] create and deploy host-secrets service
Depends on: 1342112
Depends on: 1342256
Depends on: 1342257
Depends on: 1342263
Depends on: 1351241
Depends on: 1355188
Depends on: 1355838
Depends on: 1355859
Assignee: rthijssen → dustin
No longer depends on: 1355859
This is deployed.  We're not shipping taskcluster-worker in the DC yet, so it isn't used yet, but it's working.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: