Closed Bug 1132355 Opened 9 years ago Closed 9 years ago

docker-worker: Serve livelogs over HTTPS using custom DNS server

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Unassigned)

Details

Attachments

(1 file)

52 bytes, text/x-github-pull-request
jonasfj
: review+
Details | Review
It would be really nice if we could serve live logs over HTTPS.

HTTPS support would also be nice when we start playing with interactive tasks that exposes websockets and what not... Though it might require a reverse proxy thing that we can hide behind a feature flag.

Implementation is fairly simple:
  We setup a custom DNS server that when queried for:
    ec2-54-213-239-253.us-west-2.ec2.taskcluster.net
  Returns a CNAME record pointing to:
    ec2-54-213-239-253.us-west-2.compute.amazonaws.com

When configure docker-worker with the SSL certificate for:
    *.ec2.taskcluster.net
(I haven't tried it, but I've heard something like sub-certificates is possible)


Implementing the DNS server is probably not so hard, just use something like:
   https://www.npmjs.com/package/node-named
Or one of the other modules.

A few notes:
 - We should ping a health check end-point on the worker before we return the DNS query.
 - We should limit DNS records to 5 min life time
 - when implementing expose port, we should make a reverse proxy that allows serving
   dynamic content from the task over HTTPS. Without exposing the SSL certificate to
   the task container.

Security considerations:
 - If always accessed over HTTPS, then there is nothing to fear.
 - If accessed over HTTP, there is a risk 

Remark:
It might also be possible to this with dynamic DNS. We can likely find a hosted dynamic DNS service; but these also seem expensive and likely require preconfigured hostnames.
If they had an API we could just use a uuid as the subdomain and register it.

But one of the nice things about the custom DNS server hack, is that we return a
CNAME, so there is a pretty good chance that is already cached. And we don't have to
do additional setup on docker-worker, ie. inform a dynamic dns server about our ip
when we launch an ec2 instance.
Granted a custom DNS server requires UDP access, so we have to run it on EC2 directly.
Though might try using dotCloud or one of the other docker providers, like
tutum.co (which is bring our own AWS/Azure/DigitalOcean subscription).

---
Anyways, I think this could totally work. And doesn't seem impossibly hard to setup and test. Though I'm not sure it's possible to direct a full subdomain like ec2.tc.net to a single DNS server. If not we can just buy a domain like taskcluster-worker.net.
And set our custom DNS server authority for that domain.
(Hmm... I probably ought to take the time to figure out exactly how DNS records work)
One of these days I'm really going to learn DNS...

Anyways, we don't a  custom DNS server, or dynamic DNS...

We just need a DNAME record, as follows:
ec2.taskcluster.net IN DNAME compute.amazonaws.com

And then put the SSL certificate for *.ec2.taskcluster.net into docker-worker and everybody wins.

Ignore the title but see: http://blacka.com/david/2006/12/04/dns-dname-is-almost-useless/
(just read the first part)

I'm not sure DNAME is widely support, but if it's a problem we can find a DNS server that emulates it.
Which is exactly what I described above.

Note, this will mean that anyone launching an ec2 node can host a HTTP server with a hostname postfixed:
   .ec2.taskcluster.net
Ie. we cannot trust  *.ec2.taskcluster.net domains. Well, unless they are over HTTPS, in which case
everything is awesome. And since mixed content errors will protect us from accidentally accessing over HTTP
this should work well :)
Remark, we can probably use:
https://github.com/jwilder/nginx-proxy

To reverse proxy the DNS... then we just give docker containers, environment variables:
  VIRTUAL_HOST = <prefix-from-hostname>.ec2.taskcluster.net
  VIRTUAL_PORT = <random port>

They can probably proxy our existing live log stuff too. No need to reinvent any wheels.
Blocked on bug 1133131, figuring out of Mozilla support DNAME records.
If not we can probably find somewhere else to host DNS... It seems pretty simple.
Depends on: 1133131
So I deployed bind9 on tutum.co and got a DNAME record working perfectly fine.. We'll except for the OpenDNS server that closest to me... Presumably I gave that one a bad record or something :)

Anyways, we just need to figure out a good domain name, buy SSL certificate, and then I can deploy a custom DNS server for it...
No longer depends on: 1133131
I deployed bind9 servers, and filed:

bug 1133994 for registration of domain name,
bug 1134000 for SSL certificate
webops had issues making the SSL certificate we needed to use a DNAME record.
So I deployed a custom DNS server.

The names will now have to be on the format:
ec2-52-11-22-87-dot-us-west-2-ec2.taskcluster-worker.net

Basically, take the hostname, replace ".compute.amazonaws.com" with "-ec2.taskcluster-worker.net".
And replace any "." before that with "-dot-".

Good news is that we'll only need a single wildcard certificate.
Summary: docker-worker: Serve livelogs over HTTPS → docker-worker: Serve livelogs over HTTPS using custom DNS server
I got the cert!!! :)

We can start playing with:
  https://github.com/jwilder/nginx-proxy

As far as I can we just need to give it the certs, and set the environment variables:
  - VIRTUAL_HOST
  - VIRTUAL_PORT
on the livelog serving container.

Then the nginx-proxy will reload when containers are started and stopped.
Granted we'll have for forbid others from setting the VIRTUAL_HOST and VIRTUAL_PORT variables.
We can also change the env var names read by nginx-proxy and modify the docker image.
Maybe TASKCLUSTER_VIRTUAL_HOST and TASKCLUSTER_VIRTUAL_PORT are better.
IMO we should also extend task.payload with:
task.payload = {
  image: ...
  ... // the usual stuff
  exposeWebInterface: true | false
  // or something like that, such that when exposeWeb === true, we
  // require the scope docker-worker:expose-https
  // and defined env vars:
  // TASKCLUSTER_VIRTUAL_HOST and TASKCLUSTER_VIRTUAL_PORT
  // which will auto-link with nginx-proxy and inject PORT and hostname into the container.
}

Note for now we should probably just decide what env var names to use.
Or if we want to inject config into nginx-proxy by other means, for example a file.

Anyways, I noticed a minor issue with the us-east-1 nodes not being under: .compute.amazonaws.com
I'll update the custom DNS server to be more flexible and update here with what the
pattern for rewriting ec2 hostnames to taskcluster-worker.net hostnames should be.
Looking at jwilder/nginx-proxy it seems we don't have to use the magic env vars.
We just mount nginx configs from host into the official nginx image at:
  /etc/nginx/conf.d
And certificates at:
  /etc/nginx/certs

And whenever we update the nginx config, because we've added a new livelog container
that needs a reverse HTTPS proxy we make nginx update it's config with:
  $ docker kill -s HUP <nginx-container-id>

---
Okay, thinking about this I'm not sure it'll work, because we need to --link things correctly.
And in this case I think it's the nginx container that has to be the client of the --link operation.
So maybe we need to bake the HTTPS part into the livelog hosting image:
  http://golang.org/pkg/net/http/#ListenAndServeTLS

That might actually be the easiest, then we can play with nginx when we want to expose
ports from task containers to the world over HTTPS.
FYI, the livelog things lives at:
https://github.com/taskcluster/livelog

We just need to inject:
 - cert
 - key

And probably a token though env var to fix bug 1132221.
(We might as well do that when fixing this).
I updated our custom DNS server to serve the following mapping:
  -ec2.taskcluster-worker.net  =>  .amazonaws.com

Meaning that <prefix>-ec2.taskcluster-worker.net is CNAMEd to <prefix-with-dots>.amazonaws.com.
  , where <prefix-with-dots> = <prefix>.replace(/-dot-/g, '.') 
    i.e. all instances of '-dot-' replaced with '.'

So a node with a hostname like:
  ec2-54-213-239-253.us-west-2.compute.amazonaws.com
can also be accessed using the hostname:
  ec2-54-213-239-253-dot-us-west-2-dot-compute-ec2.taskcluster-worker.net
(for which we have an SSL certificate)
So the CNAME pattern described in Comment 10 was deployed yesterday.

By I have this lovely/crazy idea. The thing that is a bit sketchy with the custom DNS server
is that any EC2 node will be able to serve content under a CNAME on the form:
  ...-ec2.taskcluster-worker.net

But we could actually avoid this. With some clever hacks, and an HMAC signature in the domain name.
All EC2 nodes have a public ip adresse, all livelog URLs have an expiration (maxRuntime), and
we can embed a shared secret in docker-worker and our custom DNS server.

So the <prefix> before -ec2.taskcluster-worker.net could be constructed as follows:
  base32(<ip> + <timestamp> + <HMAC-256(ip + timestamp, secret)>) + '-ec2.taskcluster-worker.net'

When the DNS server gets a request for this kind of domain, it would then, extract the
ip, timestamp and HMAC, validate the HMAC and return an A record for the IP, if the HMAC
signature is valid.

With a 4 byte ip, 8 byte timestamp, 32 HMAC signature the domain names would be on the form:
aaaaaaaaaabbbbbbbbbbaaaaaaaaaabbbbbbbbbbaaaaaaaaaabbbbbbbbbbaaaaaaaaaac-ec2.taskcluster-worker.net
And if we truncate the HMAC signature to 16 bytes (128 bits), we an even short domain name:
aaaaaaaaaabbbbbbbbbbaaaaaaaaaabbbbbbbbbbccccc-ec2.taskcluster-worker.net

The timestamp is obviously expiration of the livelog, such that the DNS record isn't worth much after
our spot instance goes away. We could also tie the timestamp as expiration of the worker node, and set
it 72 hours into the future. Then have the aws-provisioner generate the domain name, such that the
HMAC secret only has to live in the AWS provisioner and custom DNS server. Otherwise it'll be enough
to compromise docker-worker to get both the HMAC secret and the SSL certificates.

Note: truncation of HMAC-256 to 128bit **might** be reasonable. It's hard to find a source who
dares to claim is perfectly okay. But various google searches suggests that it's been done before :)

I'm not sure it's worth guarding against this. It's not like it's a huge security issue if someone
else starts hosting HTTP content under -ec2.taskcluster-worker.net. But it would be nice if others
didn't host under this domain.

Note, we could also do it with out any expiration or signature, and then have a very short
domain name for workers, because we only need to embed the public IP and not the ec2 hostname.
  aaaaaaa-ec2.taskcluster.net
where aaaaaaa encodes the public ip in base32. Note this would be less secure than the current
approach which is already questionable, but at least bound to nodes with a hostname under amazonaws.com

---
Anyways, just another crazy thought/suggestion. Let me know what people think.
So we have deployed: https://github.com/taskcluster/stateless-dns-server
(Also published to npm and contains logic to generate hostname)
And I have updated livelog: https://quay.io/repository/mozilla/livelog

garndt has the secret for the DNS server.
Support for secret accessToken (bug 1132221) will land along with this.
I have SSL certificate and key, I'll email garndt so he also has it.

We'll have to volume mount the certificate and key file. Then set env vars:
 * ACCESS_TOKEN      secret access token required for access (required)
 * SERVER_CRT_FILE   path to SSL certificate file (optional)
 * SERVER_KEY_FILE   path to SSL private key file (optional)
Attached file Worker PR 88
This implements both secure logging as well as tokens appended to log url for obscurity
Attachment #8607302 - Flags: review?(jopsen)
Comment on attachment 8607302 [details] [review]
Worker PR 88

Yay, HTTPS livelogs...
Attachment #8607302 - Flags: review?(jopsen) → review+
https://github.com/taskcluster/docker-worker/commit/a824b35490aeff0533c0b9fc178f3dc78ae05632
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: