Closed Bug 1384518 Opened 7 years ago Closed 7 years ago

Switch from Memcached to Redis

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(3 files)

For caching with Django we currently use memcached - or more specifically the pylibmc Python package that uses libmemcached underneath. However there are quite a few issues: * pylibmc doesn't support SSL natively (nor do any of the other libraries), so we have to use the memcachier-tls-buildpack, which has a bunch of issues (see bug 1300082) * libmemcached is pretty buggy - eg: - https://github.com/lericson/pylibmc/issues/202 - https://github.com/lericson/pylibmc/issues/218 - https://github.com/lericson/pylibmc/issues/219 * memcached isn't offered natively by Heroku, so we have to use a third party offering (memcachier), which is yet another third party dependency * redis seems to be faster/more modern/generally more highly recommended: https://www.peterbe.com/plog/fastest-cache-backend-possible-for-django As such I think we should switch to Redis. Further reference for if we do so: https://www.peterbe.com/plog/fastest-redis-optimization-for-django
Continuing on from bug 1300082 comment 2: > I've been able to test Heroku Redis' TLS (since it comes for free with any > prod plans). It currently uses self-signed certs, REDIS_URL could be > improved, and the docs were missing some details, so I've filed: > https://help.heroku.com/tickets/396267 > https://help.heroku.com/tickets/396276 > https://help.heroku.com/tickets/396297 > https://github.com/heroku/heroku-buildpack-redis/issues/15 > https://github.com/heroku/heroku-buildpack-redis/issues/16 I've filed another Heroku ticket asking for what updates they have a year on: https://help.heroku.com/tickets/511098
(In reply to Ed Morley [:emorley] from comment #1) > I've filed another Heroku ticket asking for what updates they have a year on: > https://help.heroku.com/tickets/511098 """ Hi Ed, Unfortunately none of these have been implemented. There have been discussions around moving away from self-signed certificates, but no action has been take at this stage. """
Component: Treeherder → Treeherder: Infrastructure
Followed by: """ I'm afraid that this work hasn't been scheduled as of yet. There have been some changes in the product team which works on our Data services such that this would be a good time to raise this issue again. I will definitely bring this into our next meeting but for the time being I don't have any more useful information for you. Apologies that I don't have any concrete answers for you on this. """
Next steps for this: 1) Trying to decide between the two Django backends (that both use redis-py underneath): https://github.com/niwinz/django-redis https://github.com/sebleier/django-redis-cache See: https://stackoverflow.com/questions/21932097/difference-between-django-redis-cache-and-django-redis-for-redis-caching-with-dj https://github.com/niwinz/django-redis/issues/217 https://github.com/sebleier/django-redis-cache/issues/121 (At first glance, django-redis seems more actively maintained) 2) Picking between the Heroku elements addons for Redis: https://elements.heroku.com/addons/heroku-redis https://elements.heroku.com/addons/rediscloud https://elements.heroku.com/addons/redisgreen https://elements.heroku.com/addons/redistogo Factors to consider: * Do they support TLS (most don't) * If they are third party (rather than Heroku's own "Heroku Redis"), are we fine with the added dependency? (I'm strongly leaning towards Heroku Redis unless there's something drastically wrong with their offering) 3) For configuring TLS, do we use stunnel (eg via https://github.com/heroku/heroku-buildpack-redis , which is like the memcachier one except much improved) or else connect directly via redis-py's new-ish native support of TLS? My preference is very much to try and avoid using stunnel if at all possible. See: https://devcenter.heroku.com/articles/securing-heroku-redis https://github.com/andymccurdy/redis-py/issues/780 4) Deciding what size plan we need (based on storage and max connection requirements) Presuming we pick Heroku Redis, we'll have to pick one of these: https://elements.heroku.com/addons/heroku-redis (I'm guessing "Premium 3" +/- a level) 5) To actually implement this, I'd: * Modify the Vagrant/Travis setup scripts to install redis and not memcached * Update the requirements files to swap pylibmc and related with the redis alternatives * Adjust settings.py to make Django use the new backend and pull the necessary credentials and TLS options from the environment variables (see https://devcenter.heroku.com/articles/heroku-redis#connecting-in-django and https://devcenter.heroku.com/articles/securing-heroku-redis#connecting-directly-to-stunnel and https://github.com/andymccurdy/redis-py/issues/780) * Double-checking for redis differences (eg do we need to set different default timeouts?) * Make sure tests and ingestion working in Vagrant (note iirc tests use memory cache + tests passing on Travis * Add the Heroku addon to prototype and deploy the branch there, ready for PR review * Before merge, add the addons to stage/prod (plus renaming to follow the convention we've used for the other addons)
Oh and: * Checking what New Relic reports for cache get/set times compared to pylibmc+memcachier * When this bug is fixed, closing bug 1300082 and bug 1360516 as wontfix
So I deployed my WIP to the prototype Heroku instance 25 mins ago, and the speed up over memcached (or at least Memcachier) are pretty impressive: https://screenshots.firefox.com/AAsK3EweR6yZOE1e/rpm.newrelic.com So not only are we getting a more reliable/maintained/modern/feature-rich caching solution, it's three times faster :-)
Assignee: nobody → emorley
Blocks: 1409679
Status: NEW → ASSIGNED
Priority: P3 → P1
Attachment #8942371 - Flags: review?(cdawson)
Comment on attachment 8942371 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3131 Awesome! :)
Attachment #8942371 - Flags: review?(cdawson) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f8a8188d262824c176d77820ef0b013d97afc2ed Bug 1384518 - Tests: Clean up cache testing in test_setup.py * Adjusts the test name and docstring, since there is nothing memcached-specific about the test. * Overrides the default timeout of never to 10 seconds, to prevent previous test run values persisting. https://github.com/mozilla/treeherder/commit/03027d8ba9f4f5c8a493cedb76dc911aa01782a1 Bug 1384518 - Tests: Use a cache key prefix when running tests To prevent them clashing with keys created by running data ingestion locally in Vagrant. https://github.com/mozilla/treeherder/commit/56697adb1f044b3446cebd6c69aa7f8094467e6e Bug 1384518 - Vagrant: Simplify .profile helper functions Uses our custom `clear_cache` Django management command instead of restarting memcached, since (a) for memcached has the same effect, (b) will continue to work even after we switch to Redis (which instead persists cache contents across server restarts). Removes the `threstartmemcached` command since IMO it's no simpler than just running `./manage.py clear_cache` on its own. https://github.com/mozilla/treeherder/commit/12ec94eb6a7506a92f7a97cd1e95935eba62dfb3 Bug 1384518 - Move server_supports_tls() out of settings.py Since we'll soon have move settings utility functions to go alongside it. Also renames it to `connection_should_use_tls()`. https://github.com/mozilla/treeherder/commit/525eaeb1c1a8bf903201b2c40d869516406a1c11 Bug 1384518 - Add a utils.py helper for hostname parsing https://github.com/mozilla/treeherder/commit/009e6e23546124a525952bcac13c2092a31054eb Bug 1384518 - Remove explicit cache VERSION We've never had a need to adjust it so far, and the default version is already `1`, so there's no need to manually specify it: https://docs.djangoproject.com/en/1.11/ref/settings/#version https://github.com/mozilla/treeherder/commit/613dfdc12339637101adaa95b88ca0db01a94958 Bug 1384518 - Factor out last push ID cache key name Reduces duplication and chance of the get/set key names not matching. https://github.com/mozilla/treeherder/commit/a429764dec479140257477f6b47e10e0758ff5ad Bug 1384518 - Use finite cache timeouts everywhere Previously the `cache.set()` calls made by buildapi and pushlog ingestion did not pass a timeout argument, so used the `settings.py` global default of `None` (which means never expire). There are a few disadvantages with this approach: 1) The `cache.set()` calls can't be fully interpreted on their own. 2) Never expiring means stale keys could be left behind (for example the last push ID key, when a repository is removed). 3) If Django core functionality or third-party packages use the cache but don't set their own timeout, they may be expecting the Django default timeout of 5 minutes (or at least a finite timeout). Now all cache writes in Treeherder explicitly set a timeout, and the global timeout is restored to the Django default of 5 minutes: https://docs.djangoproject.com/en/1.11/ref/settings/#timeout
Attachment #8943343 - Flags: review?(cdawson)
Comment on attachment 8943343 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3145 Outstanding. :)
Attachment #8943343 - Flags: review?(cdawson) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/c36408c1f1fc17c483f3408adf9d28809c7d8e54 Bug 1384518 - Remove Django cache prefix The prefix is unnecessary since prod/stage/prototype don't share cache instances, and the tests now set their own prefix, so won't overlap with values set during local data ingestion. Django's default prefix is the empty string: https://docs.djangoproject.com/en/1.11/ref/settings/#std:setting-CACHES-KEY_PREFIX https://github.com/mozilla/treeherder/commit/d9909d5fa2d63eda6ce0b228df04577807ff1462 Bug 1384518 - Add a Redis server instance to Vagrant/Travis In preparation for later commits, this installs `redis-server` in the Vagrant environment, and starts the pre-installed (but stopped by default) `redis-server` on Travis: https://docs.travis-ci.com/user/database-setup/#Redis Production/stage/prototype will use Heroku's own Redis add-on: https://elements.heroku.com/addons/heroku-redis https://github.com/mozilla/treeherder/commit/deb1fc86b77c74ad3a5861139221dc57ace58704 Bug 1384518 - Switch from Memcached to Redis Since: * Redis has additional features we need (eg for bug 1409679) * the Redis server, python client and Django backend are more actively maintained (so have persistent connections/pooling that actually works, which gives a big perf win) * we can use Heroku's own Redis addon rather than relying on a third-party's (to hopefully prevent a repeat of the certificate expiration downtime) This commit: * Switches pylibmc/django-pylibmc to redis-py/django-redis, and configures the new backend according to: http://niwinz.github.io/django-redis/latest/ https://github.com/andymccurdy/redis-py * Uses redis-py's native support for TLS to connect to the Heroku Redis server's stunnel port directly, avoiding the complexity of using a buildpack to create an stunnel between the dyno and server: https://devcenter.heroku.com/articles/securing-heroku-redis#connecting-directly-to-stunnel * Uses explicit `REDIS_URL` values on Travis/Vagrant rather than relying on the django-environ `default` value (for parity with how we configure `DATABASE_URL` and others). * Removes the pylibmc connection-closing workaround from `wsgi.py`. Note: Whilst the Heroku docs suggest using `django-redis-cache`, it's not as actively maintained, so we're using `django-redis` instead: https://devcenter.heroku.com/articles/heroku-redis https://github.com/niwinz/django-redis https://github.com/sebleier/django-redis-cache Before this is merged, Heroku Redis instances will need to be created for stage/production (prototype done) - likely on plan `premium-3`: https://elements.heroku.com/addons/heroku-redis We'll also need to pick an eviction policy: https://devcenter.heroku.com/articles/heroku-redis#maxmemory-policy Once deployed, the `memcachier-tls-buildpack` buildpack will no longer be required and can be removed from prototype/stage/prod, along with the `TREEHERDER_MEMCACHED` environment variable and Memcachier addon. https://github.com/mozilla/treeherder/commit/4d45f3cf5563fae4dee2c2a03c1afe3fd7a74461 Bug 1384518 - Vagrant: Remove memcached server and libraries Since memcached server and the client headers are now unused. Whilst the `zlib1g-dev` package is no longer needed by pylibmc, it's still required as a sub-dependency of `libmysqlclient-dev`, so has deliberately not been added to the `apt-get purge` cleanup step.
I added a Redis instance to stage before merging to master. Strangely the new stage instance is running Redis 3.2.8 whereas prototype's instance is running 3.2.9. The Heroku docs say that new instance always use the latest version, so this is the opposite of what I would have expected. I've opened a ticket (ask for the share URL if desired): https://help.heroku.com/tickets/547488
Summary: Investigate switching to Redis from memcached → Switch from Memcached to Redis
Looking at the Redis storage usage on prototype, it seems that the `premium-2` plan (256MB, 200 connections) will actually be enough after all. I've updated the prototype's addon to use that, and used that plan when I added a redis instance to stage too. I've removed the memcachier-tls-buildpack buildpack from prototype/stage: $ thd buildpacks:remove -i 1 Buildpack removed. Next release on treeherder-prototype will use: 1. heroku/nodejs 2. heroku/python $ ths buildpacks:remove -i 1 Buildpack removed. Next release on treeherder-stage will use: 1. heroku/nodejs 2. heroku/python And the `TREEHERDER_MEMCACHED` environment variable (was previously set to `localhost:11211,localhost:11211` due to the stunnel, and now no longer required): $ thd config:unset TREEHERDER_MEMCACHED Unsetting TREEHERDER_MEMCACHED and restarting treeherder-prototype... done, v2082 $ ths config:unset TREEHERDER_MEMCACHED Unsetting TREEHERDER_MEMCACHED and restarting treeherder-stage... done, v1562 And the memcachier addon: $ thd addons:destroy treeherder-dev-memcachier --confirm treeherder-prototype Destroying treeherder-dev-memcachier on treeherder-prototype... done $ ths addons:destroy treeherder-stage-memcachier --confirm treeherder-stage Destroying treeherder-stage-memcachier on treeherder-stage... done And removed the New Relic entries: https://rpm.newrelic.com/accounts/677903/plugins/17846 And cleared the build cache (since we no longer need the buildpack cruft left in it): $ thd repo:purge_cache ... $ ths repo:purge_cache ... --- Left to do: * wait for reply to https://help.heroku.com/tickets/547488 * create heroku redis instance on prod * deploy to prod * wait N days to check everything working * perform above cleanup steps on prod too * decide whether we want to change the eviction policy from default
This has now been deployed on production and is looking great - 80% drop in average query time :-)
Prod looking good, so I've performed the cleanup against prod too (same as in comment 14).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I've changed prod's eviction policy from the default (no eviction, fail to insert once full), to `allkeys-lru`, to prevent possibly issues in the future should our usage suddenly increase more than we expect (currently we're using 120MB/250MB). $ thp redis:maxmemory --policy allkeys-lru Maxmemory policy for treeherder-prod-redis (REDIS_URL) set to allkeys-lru. allkeys-lru evict keys trying to remove the less recently used keys first. However I've left prototype/stage at `noeviction` so we get early warnings (via New Relic exceptions about cache inserts) of needing to bump the storage allocated. See: https://data.heroku.com/datastores/15914e46-a9ee-4efd-b857-3ddc45df9d35 https://devcenter.heroku.com/articles/heroku-redis#maxmemory-policy https://redis.io/topics/lru-cache
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: