Closed
Bug 1384518
Opened 7 years ago
Closed 7 years ago
Switch from Memcached to Redis
Categories
(Tree Management :: Treeherder: Infrastructure, enhancement, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Attachments
(3 files)
For caching with Django we currently use memcached - or more specifically the pylibmc Python package that uses libmemcached underneath.
However there are quite a few issues:
* pylibmc doesn't support SSL natively (nor do any of the other libraries), so we have to use the memcachier-tls-buildpack, which has a bunch of issues (see bug 1300082)
* libmemcached is pretty buggy - eg:
- https://github.com/lericson/pylibmc/issues/202
- https://github.com/lericson/pylibmc/issues/218
- https://github.com/lericson/pylibmc/issues/219
* memcached isn't offered natively by Heroku, so we have to use a third party offering (memcachier), which is yet another third party dependency
* redis seems to be faster/more modern/generally more highly recommended:
https://www.peterbe.com/plog/fastest-cache-backend-possible-for-django
As such I think we should switch to Redis.
Further reference for if we do so:
https://www.peterbe.com/plog/fastest-redis-optimization-for-django
Assignee | ||
Comment 1•7 years ago
|
||
Continuing on from bug 1300082 comment 2:
> I've been able to test Heroku Redis' TLS (since it comes for free with any
> prod plans). It currently uses self-signed certs, REDIS_URL could be
> improved, and the docs were missing some details, so I've filed:
> https://help.heroku.com/tickets/396267
> https://help.heroku.com/tickets/396276
> https://help.heroku.com/tickets/396297
> https://github.com/heroku/heroku-buildpack-redis/issues/15
> https://github.com/heroku/heroku-buildpack-redis/issues/16
I've filed another Heroku ticket asking for what updates they have a year on:
https://help.heroku.com/tickets/511098
Assignee | ||
Comment 2•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #1)
> I've filed another Heroku ticket asking for what updates they have a year on:
> https://help.heroku.com/tickets/511098
"""
Hi Ed,
Unfortunately none of these have been implemented. There have been discussions around moving away from self-signed
certificates, but no action has been take at this stage.
"""
Assignee | ||
Updated•7 years ago
|
Component: Treeherder → Treeherder: Infrastructure
Assignee | ||
Comment 3•7 years ago
|
||
Followed by:
"""
I'm afraid that this work hasn't been scheduled as of yet. There have been some changes in the product team which works on our Data services such that this would be a good time to raise this issue again. I will definitely bring this into our next meeting but for the time being I don't have any more useful information for you. Apologies that I don't have any concrete answers for you on this.
"""
Assignee | ||
Comment 4•7 years ago
|
||
Next steps for this:
1) Trying to decide between the two Django backends (that both use redis-py underneath):
https://github.com/niwinz/django-redis
https://github.com/sebleier/django-redis-cache
See:
https://stackoverflow.com/questions/21932097/difference-between-django-redis-cache-and-django-redis-for-redis-caching-with-dj
https://github.com/niwinz/django-redis/issues/217
https://github.com/sebleier/django-redis-cache/issues/121
(At first glance, django-redis seems more actively maintained)
2) Picking between the Heroku elements addons for Redis:
https://elements.heroku.com/addons/heroku-redis
https://elements.heroku.com/addons/rediscloud
https://elements.heroku.com/addons/redisgreen
https://elements.heroku.com/addons/redistogo
Factors to consider:
* Do they support TLS (most don't)
* If they are third party (rather than Heroku's own "Heroku Redis"), are we fine with the added dependency?
(I'm strongly leaning towards Heroku Redis unless there's something drastically wrong with their offering)
3) For configuring TLS, do we use stunnel (eg via https://github.com/heroku/heroku-buildpack-redis , which is like the memcachier one except much improved) or else connect directly via redis-py's new-ish native support of TLS?
My preference is very much to try and avoid using stunnel if at all possible.
See:
https://devcenter.heroku.com/articles/securing-heroku-redis
https://github.com/andymccurdy/redis-py/issues/780
4) Deciding what size plan we need (based on storage and max connection requirements)
Presuming we pick Heroku Redis, we'll have to pick one of these:
https://elements.heroku.com/addons/heroku-redis
(I'm guessing "Premium 3" +/- a level)
5) To actually implement this, I'd:
* Modify the Vagrant/Travis setup scripts to install redis and not memcached
* Update the requirements files to swap pylibmc and related with the redis alternatives
* Adjust settings.py to make Django use the new backend and pull the necessary credentials and TLS options from the environment variables (see https://devcenter.heroku.com/articles/heroku-redis#connecting-in-django and https://devcenter.heroku.com/articles/securing-heroku-redis#connecting-directly-to-stunnel and https://github.com/andymccurdy/redis-py/issues/780)
* Double-checking for redis differences (eg do we need to set different default timeouts?)
* Make sure tests and ingestion working in Vagrant (note iirc tests use memory cache + tests passing on Travis
* Add the Heroku addon to prototype and deploy the branch there, ready for PR review
* Before merge, add the addons to stage/prod (plus renaming to follow the convention we've used for the other addons)
Assignee | ||
Comment 5•7 years ago
|
||
Oh and:
* Checking what New Relic reports for cache get/set times compared to pylibmc+memcachier
* When this bug is fixed, closing bug 1300082 and bug 1360516 as wontfix
Assignee | ||
Comment 6•7 years ago
|
||
So I deployed my WIP to the prototype Heroku instance 25 mins ago, and the speed up over memcached (or at least Memcachier) are pretty impressive:
https://screenshots.firefox.com/AAsK3EweR6yZOE1e/rpm.newrelic.com
So not only are we getting a more reliable/maintained/modern/feature-rich caching solution, it's three times faster :-)
Comment 7•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Updated•7 years ago
|
Attachment #8942371 -
Flags: review?(cdawson)
Comment 8•7 years ago
|
||
Comment on attachment 8942371 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3131
Awesome! :)
Attachment #8942371 -
Flags: review?(cdawson) → review+
Comment 9•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/f8a8188d262824c176d77820ef0b013d97afc2ed
Bug 1384518 - Tests: Clean up cache testing in test_setup.py
* Adjusts the test name and docstring, since there is nothing
memcached-specific about the test.
* Overrides the default timeout of never to 10 seconds, to prevent
previous test run values persisting.
https://github.com/mozilla/treeherder/commit/03027d8ba9f4f5c8a493cedb76dc911aa01782a1
Bug 1384518 - Tests: Use a cache key prefix when running tests
To prevent them clashing with keys created by running data ingestion
locally in Vagrant.
https://github.com/mozilla/treeherder/commit/56697adb1f044b3446cebd6c69aa7f8094467e6e
Bug 1384518 - Vagrant: Simplify .profile helper functions
Uses our custom `clear_cache` Django management command instead of
restarting memcached, since (a) for memcached has the same effect,
(b) will continue to work even after we switch to Redis (which instead
persists cache contents across server restarts).
Removes the `threstartmemcached` command since IMO it's no simpler
than just running `./manage.py clear_cache` on its own.
https://github.com/mozilla/treeherder/commit/12ec94eb6a7506a92f7a97cd1e95935eba62dfb3
Bug 1384518 - Move server_supports_tls() out of settings.py
Since we'll soon have move settings utility functions to go alongside
it. Also renames it to `connection_should_use_tls()`.
https://github.com/mozilla/treeherder/commit/525eaeb1c1a8bf903201b2c40d869516406a1c11
Bug 1384518 - Add a utils.py helper for hostname parsing
https://github.com/mozilla/treeherder/commit/009e6e23546124a525952bcac13c2092a31054eb
Bug 1384518 - Remove explicit cache VERSION
We've never had a need to adjust it so far, and the default version
is already `1`, so there's no need to manually specify it:
https://docs.djangoproject.com/en/1.11/ref/settings/#version
https://github.com/mozilla/treeherder/commit/613dfdc12339637101adaa95b88ca0db01a94958
Bug 1384518 - Factor out last push ID cache key name
Reduces duplication and chance of the get/set key names not matching.
https://github.com/mozilla/treeherder/commit/a429764dec479140257477f6b47e10e0758ff5ad
Bug 1384518 - Use finite cache timeouts everywhere
Previously the `cache.set()` calls made by buildapi and pushlog
ingestion did not pass a timeout argument, so used the `settings.py`
global default of `None` (which means never expire).
There are a few disadvantages with this approach:
1) The `cache.set()` calls can't be fully interpreted on their own.
2) Never expiring means stale keys could be left behind (for example
the last push ID key, when a repository is removed).
3) If Django core functionality or third-party packages use the cache
but don't set their own timeout, they may be expecting the Django
default timeout of 5 minutes (or at least a finite timeout).
Now all cache writes in Treeherder explicitly set a timeout, and the
global timeout is restored to the Django default of 5 minutes:
https://docs.djangoproject.com/en/1.11/ref/settings/#timeout
Comment 10•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Attachment #8943343 -
Flags: review?(cdawson)
Comment 11•7 years ago
|
||
Comment on attachment 8943343 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3145
Outstanding. :)
Attachment #8943343 -
Flags: review?(cdawson) → review+
Comment 12•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/c36408c1f1fc17c483f3408adf9d28809c7d8e54
Bug 1384518 - Remove Django cache prefix
The prefix is unnecessary since prod/stage/prototype don't share cache
instances, and the tests now set their own prefix, so won't overlap
with values set during local data ingestion. Django's default prefix
is the empty string:
https://docs.djangoproject.com/en/1.11/ref/settings/#std:setting-CACHES-KEY_PREFIX
https://github.com/mozilla/treeherder/commit/d9909d5fa2d63eda6ce0b228df04577807ff1462
Bug 1384518 - Add a Redis server instance to Vagrant/Travis
In preparation for later commits, this installs `redis-server` in
the Vagrant environment, and starts the pre-installed (but stopped
by default) `redis-server` on Travis:
https://docs.travis-ci.com/user/database-setup/#Redis
Production/stage/prototype will use Heroku's own Redis add-on:
https://elements.heroku.com/addons/heroku-redis
https://github.com/mozilla/treeherder/commit/deb1fc86b77c74ad3a5861139221dc57ace58704
Bug 1384518 - Switch from Memcached to Redis
Since:
* Redis has additional features we need (eg for bug 1409679)
* the Redis server, python client and Django backend are more
actively maintained (so have persistent connections/pooling that
actually works, which gives a big perf win)
* we can use Heroku's own Redis addon rather than relying on a
third-party's (to hopefully prevent a repeat of the certificate
expiration downtime)
This commit:
* Switches pylibmc/django-pylibmc to redis-py/django-redis, and
configures the new backend according to:
http://niwinz.github.io/django-redis/latest/
https://github.com/andymccurdy/redis-py
* Uses redis-py's native support for TLS to connect to the Heroku
Redis server's stunnel port directly, avoiding the complexity of
using a buildpack to create an stunnel between the dyno and server:
https://devcenter.heroku.com/articles/securing-heroku-redis#connecting-directly-to-stunnel
* Uses explicit `REDIS_URL` values on Travis/Vagrant rather than
relying on the django-environ `default` value (for parity with
how we configure `DATABASE_URL` and others).
* Removes the pylibmc connection-closing workaround from `wsgi.py`.
Note: Whilst the Heroku docs suggest using `django-redis-cache`, it's
not as actively maintained, so we're using `django-redis` instead:
https://devcenter.heroku.com/articles/heroku-redis
https://github.com/niwinz/django-redis
https://github.com/sebleier/django-redis-cache
Before this is merged, Heroku Redis instances will need to be created
for stage/production (prototype done) - likely on plan `premium-3`:
https://elements.heroku.com/addons/heroku-redis
We'll also need to pick an eviction policy:
https://devcenter.heroku.com/articles/heroku-redis#maxmemory-policy
Once deployed, the `memcachier-tls-buildpack` buildpack will no longer
be required and can be removed from prototype/stage/prod, along with
the `TREEHERDER_MEMCACHED` environment variable and Memcachier addon.
https://github.com/mozilla/treeherder/commit/4d45f3cf5563fae4dee2c2a03c1afe3fd7a74461
Bug 1384518 - Vagrant: Remove memcached server and libraries
Since memcached server and the client headers are now unused.
Whilst the `zlib1g-dev` package is no longer needed by pylibmc, it's
still required as a sub-dependency of `libmysqlclient-dev`, so has
deliberately not been added to the `apt-get purge` cleanup step.
Assignee | ||
Comment 13•7 years ago
|
||
I added a Redis instance to stage before merging to master. Strangely the new stage instance is running Redis 3.2.8 whereas prototype's instance is running 3.2.9. The Heroku docs say that new instance always use the latest version, so this is the opposite of what I would have expected. I've opened a ticket (ask for the share URL if desired):
https://help.heroku.com/tickets/547488
Summary: Investigate switching to Redis from memcached → Switch from Memcached to Redis
Assignee | ||
Comment 14•7 years ago
|
||
Looking at the Redis storage usage on prototype, it seems that the `premium-2` plan (256MB, 200 connections) will actually be enough after all. I've updated the prototype's addon to use that, and used that plan when I added a redis instance to stage too.
I've removed the memcachier-tls-buildpack buildpack from prototype/stage:
$ thd buildpacks:remove -i 1
Buildpack removed. Next release on treeherder-prototype will use:
1. heroku/nodejs
2. heroku/python
$ ths buildpacks:remove -i 1
Buildpack removed. Next release on treeherder-stage will use:
1. heroku/nodejs
2. heroku/python
And the `TREEHERDER_MEMCACHED` environment variable (was previously set to `localhost:11211,localhost:11211` due to the stunnel, and now no longer required):
$ thd config:unset TREEHERDER_MEMCACHED
Unsetting TREEHERDER_MEMCACHED and restarting treeherder-prototype... done, v2082
$ ths config:unset TREEHERDER_MEMCACHED
Unsetting TREEHERDER_MEMCACHED and restarting treeherder-stage... done, v1562
And the memcachier addon:
$ thd addons:destroy treeherder-dev-memcachier --confirm treeherder-prototype
Destroying treeherder-dev-memcachier on treeherder-prototype... done
$ ths addons:destroy treeherder-stage-memcachier --confirm treeherder-stage
Destroying treeherder-stage-memcachier on treeherder-stage... done
And removed the New Relic entries:
https://rpm.newrelic.com/accounts/677903/plugins/17846
And cleared the build cache (since we no longer need the buildpack cruft left in it):
$ thd repo:purge_cache
...
$ ths repo:purge_cache
...
---
Left to do:
* wait for reply to https://help.heroku.com/tickets/547488
* create heroku redis instance on prod
* deploy to prod
* wait N days to check everything working
* perform above cleanup steps on prod too
* decide whether we want to change the eviction policy from default
Assignee | ||
Comment 15•7 years ago
|
||
This has now been deployed on production and is looking great - 80% drop in average query time :-)
Assignee | ||
Comment 16•7 years ago
|
||
Prod looking good, so I've performed the cleanup against prod too (same as in comment 14).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 17•7 years ago
|
||
I've changed prod's eviction policy from the default (no eviction, fail to insert once full), to `allkeys-lru`, to prevent possibly issues in the future should our usage suddenly increase more than we expect (currently we're using 120MB/250MB).
$ thp redis:maxmemory --policy allkeys-lru
Maxmemory policy for treeherder-prod-redis (REDIS_URL) set to allkeys-lru.
allkeys-lru evict keys trying to remove the less recently used keys first.
However I've left prototype/stage at `noeviction` so we get early warnings (via New Relic exceptions about cache inserts) of needing to bump the storage allocated.
See:
https://data.heroku.com/datastores/15914e46-a9ee-4efd-b857-3ddc45df9d35
https://devcenter.heroku.com/articles/heroku-redis#maxmemory-policy
https://redis.io/topics/lru-cache
You need to log in
before you can comment on or make changes to this bug.
Description
•