Use TLS for the connections to Memcachier on Heroku

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P1
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

(Blocks: 1 bug)

Details

Attachments

(2 attachments)

On Heroku there are two memcache add-ons:
* Memcachier
* Memcached Cloud

These both make use of the relatively new memcache binary protocol's SASL support:
https://github.com/memcached/memcached/wiki/BinaryProtocolRevamped
https://github.com/memcached/memcached/wiki/SASLAuthProtocol
https://github.com/memcached/memcached/wiki/SASLHowto

Unfortunately for some reason neither the Memcachier, memcached, pylibmc nor python-binary-memcached docs mention one crucial fact:
-> That even with SASL both the authentication and subsequent memcache traffic is sent over plaintext.

Worse, the memcached wiki pages for SASL state:

> In order to use memcached in a hostile network (e.g. a cloudy ISP where the infrastructure is shared and you can't control it), you're going to want some kind of way to keep people from messing with your cache servers.
> 
> SASL (as described in RFC2222) is a standard for adding authentication mechanisms to protocols in a way that is protocol independent.

And:

> Most deployments of memcached today exist within trusted networks where clients may freely connect to any server and the servers don't discriminate against them.
> 
> There are cases, however, where memcached is deployed in untrusted networks or where administrators would like to exercise a bit more control over the clients that are connecting.

However these are incorrect.

I only noticed after spotting this undocumented (and not very maintained) repository by Memcacher:
https://github.com/memcachier/memcachier-tls-buildpack
...which adds an stunnel daemon to every dyno.

As such:
* I requested memcachier update their docs to actually mention the existence of the buildpack (https://github.com/memcachier/docs/issues/10)
* I've fixed the instructions for the buildpack, since they were still using the legacy multi-buildpack addon (https://github.com/memcachier/memcachier-tls-buildpack/pull/5)
* I've asked the memcached project to fix the incorrect SASL wiki docs (https://github.com/memcached/memcached/issues/184)
* I've also filed an issue against the other major cloud memcached provider to get them to document stunnel/SSL too (https://github.com/RedisLabs/rldocs/issues/4)

As for us, we can't use a VPN/VPC on Heroku's 'common runtime', so our options are:

1) Use Memcachier's rather old buildpack that has a number of issues (eg stunnel is run as a daemon rather than a wrapper script for the process being run, so if it dies you'll get connection errors and Heroku won't know to do an auto dyno restart)

2) Add TLS support to a Python memcache client ourselves using `ssl.wrap_socket()` (or more likely the newly added SSLContext `create_default_context()` of Python 2.7.9), and get it to connect to Memcachier just like stunnel would have done, similar to the approach Redis took for a handful of Redis clients here:
https://redislabs.com/blog/secure-redis-ssl-added-to-redsmin-and-clients
eg: https://github.com/andymccurdy/redis-py/pull/446/files

I'm leaning towards #2.

This will also mean switching to the pure Python python-binary-memcached, since pylibmc's connection handling is performed by libmemcached - and I'm not touching that (C++ so will have to manually implement using openssl, plus the maintainer isn't very responsive).
Blocks: 1277304
So for #2, whilst adding support to a Python client is simple (I have a POC working locally), I've come across a few other issues:

* The only pure Python client that supports the binary protocol (required for username/password auth) is python-binary-memcached (https://github.com/jaysonsantos/python-binary-memcached), which doesn't use consistent hashing, so has to do up to N requests for every lookup (https://github.com/jaysonsantos/python-binary-memcached/issues/14), which isn't great for performance. This was the client I was hoping to use.

* There is another more-maintained pure Python client (https://github.com/pinterest/pymemcache), however it doesn't yet support the binary protocol (https://github.com/pinterest/pymemcache/issues/54) so can't be used at all.

As for our current C-backed client (https://github.com/lericson/pylibmc), adding TLS support isn't practical since the connection handling is not handled in pylibmc, but in the C library it uses (libmemcached) - which is not overly maintained and would mean straying into low-level territory, which is not ideal when "rolling your own TLS support".

As such, I think the best way forwards for now is just to use the stunnel buildpack after all, since:
(a) we can try and improve it, to remove some of the issues mentioned in comment 0
(b) it means we can stick with the more performant C-backed pylibmc
(c) it means fewer changes in Treeherder/Django code, since the app itself doesn't need to be aware of the tunnel/we can stick with the same client we're using now
Going to fix this issue prior to using the buildpack however:
https://github.com/memcachier/memcachier-tls-buildpack/issues/8
Created attachment 8784431 [details] [review]
memcachier-tls-buildpack: Use a separate stunnel config entry for each server
I've added the PR branch to the buildpacks for prototype/stage/prod:
https://github.com/edmorley/memcachier-tls-buildpack.git#fix-multiple-servers
Created attachment 8786764 [details] [review]
[treeherder] mozilla:rm-memcachify > mozilla:master
Attachment #8786764 - Flags: review?(wlachance)
Attachment #8786764 - Flags: review?(wlachance) → review+

Comment 6

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/8701b64f5786e3a5fccd60c40af36701dab900dd
Bug 1291307 - Stop using django-heroku-memcacheify

Since it clobbers the settings we need to use on Heroku for making
memcached connections via stunnel. In addition, it made the cache
configuration very opaque, and doesn't use the latest best practices
suggested on:
https://www.memcachier.com/documentation#django
(In reply to Ed Morley [:emorley] from comment #2)
> Going to fix this issue prior to using the buildpack however:
> https://github.com/memcachier/memcachier-tls-buildpack/issues/8

This turns out to not be an issue due to the custom memcached server implementation that Memcachier are using (see https://github.com/memcachier/memcachier-tls-buildpack/issues/8#issuecomment-242148240), so I've reverted back to the stock buildpack, eg:
`heroku buildpacks:set -i 1 https://github.com/memcachier/memcachier-tls-buildpack.git#6ca0ea 51b98ab0ee76f24c1f25fe8c5a66e75db9`

I also:
`heroku config:set TREEHERDER_MEMCACHED="localhost:11211,localhost:11211"`

...since there needs to be a 1:1 mapping between the number of Memcachier nodes and the number of nodes Django sees (so the round-robbin on connection issues works).

After these changes, there are now New Relic errors, since apparently the Memcachier cert isn't validating.

Have filed:
https://github.com/memcachier/memcachier-tls-buildpack/issues/10

Plus another issue to save the manual steps that I had to perform to view the errors:
https://github.com/memcachier/memcachier-tls-buildpack/issues/12
Sigh, Memcachier's stunnel certs expired April 2015:

$ openssl s_client -showcerts -connect XXXXX.XXXXX.us-east-5.heroku.prod.memcachier.com:11219 < /dev/null 2> /dev/null | openssl x509 -noout -enddate
notAfter=Apr 12 06:36:53 2015 GMT

This must mean no-one else is actually using the stunnel?!
Blocks: 1300082
No response from the GitHub issue, I've filed a support ticket to escalate:
https://memcachier.zendesk.com/requests/1661
I've temporarily disabled cert verification for the stunnel, so we can at least see how reliable it is, using:
https://github.com/edmorley/memcachier-tls-buildpack.git#disable-verify

...and then deployed heroku-{prototype,stage}
Looking at New Relic so far, using stunnel results in an approx 30% increase in memcached request time, but I guess there's not much we can do about that.
The response on the Memcachier zendesk ticket:

"""
Sorry for the late response, we obviously needed to get our ducks in a row regarding the certificates.

I just renewed them all last night.

In practice very few customers use SSL. I think it's a very important concern on AWS (especially connecting across datacenters as you must with Heroku apps), but I suspect the combination of not having native memcache support in clients and some added latency, it's not something people are doing.

However, it's probably something worth pushing, so if you're going to use it, we'd appreciate you nit pick both security and usability issues you run into.
"""

The certs are now valid:

$ openssl s_client -showcerts -connect XXXXX.XXXXX.us-east-5.heroku.prod.memcachier.com:11219 < /dev/null 2> /dev/null | openssl x509 -noout -enddate
notAfter=Sep  7 01:20:05 2017 GMT

I've switched the buildpack back to https://github.com/memcachier/memcachier-tls-buildpack.git#6ca0ea51b98ab0ee76f24c1f25fe8c5a66e75db9 , so certificate verification is re-enabled.

All looks good.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.