776156 - cache machine relies on redis for the scale of zamboni

Reporter

Description

•

12 years ago

Currently Zamboni uses cache machine to do general caching and some semi-smart invalidation. It has the option of using memcached or redis for its invalidation. When using redis it uses a set to keep track of all the queries that need to be invalidated when a given row changes. In memcached it has to retrieve the list, turn it into a set, then add to the set, then put it back into memcached.

The problem with the memcached way is all the extra strain we put on memcached and the network when reading/writing larger invalidation sets like the ones for add-ons like Adblock Plus.

The problem with the redis way is we don't have a good way to distribute it across availability zones. Even if we did, it would give us another point of failure to leave it in for one part of our code base when we've been able to remove it from the rest of the codebase.

Potential solutions:
* append only to the invalidation set
* this comes with the problem of getting too large for the 1mb bucket and no deduplication.
* partially solvable by writing to something that keeps track of the keys in a bucket and starts new buckets when the current one gets too bug. (ugly/non-ideal)

* stop invalidating and just wait for timeout
* This is 8mins and janky. Users wont get immediate feedback when updating something that needs to invalid cache everywhere.

* sync redis instances, even if they are ephemeral (instead of persistent) across datacenters.
* this is not idea and adds a point of failure

* something else!
* hopefully this is what we choose.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Updated

•

12 years ago

OS: Mac OS X → All

Hardware: x86 → All

Wraithan (Chris McDonald) [:wraithan]

Reporter

Updated

•

12 years ago

Blocks: 749335

Wil Clouser [:clouserw]

Comment 1

•

12 years ago

The cache machine is at https://github.com/jbalogh/django-cache-machine/.

We're currently using this to track what queries we need to invalidate when something changes.  For example, when a search for "privacy" returns "adblock plus" we append adblock's ID to set X.  Set X is the md5sum of the query for the search for privacy.  When adblock's description changes, we flush all the queries where its ID exists in the list (thus, instant changes on the site).

This method is used for all queries on the entire site - an option may be to pare this down to queries that are hit a lot and then just let the rest go through to the db slaves.

I'm CCing some people who will hopefully ask the right questions or suggest some other options. :)

Toby Elliott [:telliott]

Comment 2

•

12 years ago

Gonna ask some obvious and dumb questions here. Bear with me, sorry:

1) What's the match criteria for a hit? Which fields?
2) What is the underlying engine that is doing the matching if there isn't a cache hit?
3) How many records are in the potential set?
4) How many records need to come back for a query?

Toby Elliott [:telliott]

Comment 3

•

12 years ago

Oh, and 5) What sort of traffic scale (qpd) are we dealing with here? How long is the tail?

Jeremy Orem [:oremj]

Comment 4

•

12 years ago

1) The match criteria is basically just an md5 of the string representation of the query.

5) We do about 1 billion MySQL queries, 2 billion redis operations, and 2.5 billion memcache gets/sets per day.

Toby Elliott [:telliott]

Comment 5

•

12 years ago

For match, I meant if we don't have a cache hit. For example, if I type 'chicken' into the search box, I get a couple chicken items, then a bunch of Chicago stuff. I was curious what the underlying engine for that was, and what fields it used.

Also, how often are the caches invalidated? It doesn't seem like Adblocker would update its description all that often.

Wil Clouser [:clouserw]

Comment 6

•

12 years ago

(In reply to Toby Elliott [:telliott] from comment #5)
> For match, I meant if we don't have a cache hit. For example, if I type
> 'chicken' into the search box, I get a couple chicken items, then a bunch of
> Chicago stuff. I was curious what the underlying engine for that was, and
> what fields it used.

Search is powered by Elastic Search which indexes the data out of the database.  Cache misses will hit elastic search.  The logic for what to return is an algorithm ES uses which combines certain fields with certain weights: description, summary, popularity, etc.

> Also, how often are the caches invalidated? It doesn't seem like Adblocker
> would update its description all that often.

This is the key here I think.  Our troubles are with the invalidation records, not with the actual query caches (if I understand it correctly).  I think oremj said there was a global invalidation timeout of 8 minutes but I'd need him to verify.

Maybe it would help to look past what we're currently doing and just look at the end goal:  To be able to have someone change the adblock plus description and have it updated instantly everywhere.  Or, as to your point about the description not changing often, consider the same scenario except with ratings - an add-on gets a new rating, we want that number to show up everywhere.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 7

•

12 years ago

This is for generic ORM level caching. So this applies to every mysql query that is executed against the database (unless they are specifically excluded), not just search or certain views.

1) I believe it goes by returned object ids, but I could be wrong, I need to dig into cache machine's source to see exactly what it is doing.

2) Cache hit we go to memchache, cache miss we go to MySQL then store in memcache and keep the list of keys to remove from memcached when something changes in a redis set.

3 & 4) I need to talk to clouserw about this and see if we can figure it out.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 8

•

12 years ago

Going to attempt to build a invalidation backend that uses elastic search.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Updated

•

12 years ago

Assignee: nobody → xwraithanx

Target Milestone: --- → 2012-08-23

Wil Clouser [:clouserw]

Comment 9

•

12 years ago

Patch is supposed to be done today.  Moving to next week for some load testing

Target Milestone: 2012-08-23 → 2012-08-30

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 10

•

12 years ago

I had what I thought was a complete patch last Friday. I couldn't get all the unit tests to pass though. I spoke with robhudson on monday and found a flaw in my thinking. I ended up needing to use the ES Update API which isn't in pyes nor elasticutils.

I investigated adding it to them but found that elasticutils is looking to move off of pyes so I've ended up doing the queries directly against ES instead of using a library.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 11

•

12 years ago

Forgot to add timeline stuff.

Have it written using queries I've written myself, have a couple tests still failing. Should be done today and I'll ping alexis about getting it load tested.

Wil Clouser [:clouserw]

Updated

•

12 years ago

Target Milestone: 2012-08-30 → 2012-09-06

Wil Clouser [:clouserw]

Comment 12

•

12 years ago

bumped to next week.  Wraithan is going to comment with status update

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 13

•

12 years ago

https://github.com/wraithan/django-cache-machine/commit/748819a

This passes all the tests. It is a 10x slowdown when running the tests, I don't know about how it performs in the real world as I didn't load test it. ES's update api doesn't work like I expected it to. Indexing isn't as instance as we need for search stuff so you either have to access the elements one by one for refresh ES, both are slow.

For this use case I've found that ES wont work for us. I am still talking to the folks in #elasticutils about how I could make this faster but I'm going to consider the ES backend a dead end.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 14

•

12 years ago

Solutions that would work ok. Using redis locally and when invalidating, post to the other webheads to say you are invalidating X and they should too.

See where redis clustering is and if it exists, have a master in each datacenter and a slave for each master in all the datacenters and iterate through them to get everything that should be invalidated. This gives us the speed of local writes for adding things to the invalidation lists, and when something is invalidated it requires 1n writes for each master. We'd have to profile to see how often we are invalidating compared to how often we are adding to the lists to see if this is acceptable.

Other than this, we can see about improving the memcached backend and see if that works and/or seeing if with membase and the newer memcached pooling stuff it is fast enough?

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 15

•

12 years ago

Oh, finally we can rip out cache-machine and do less in the way of generic caching and instead manually cache the hottest spots and see how that goes. But that sounds really scary.

Andy McKay

Comment 16

•

12 years ago

The more I think about it, the more I'd like to see cache machine go away as we move to using only Elastic Search on the front end. 

That's where the traffic is, things like devhub and admin can access the backend as seperate sites and will not have the traffic that necessitates much of cache machine, maybe just the odd sprinkling of caching here and there as Wraithan says.

Now that brings up an interesting question of caching and invalidation for elastic search, but I think that's a different problem because of the data structures.

Tarek Ziadé (:tarek)

Comment 17

•

12 years ago

> Using redis locally and when invalidating, post to the other webheads to say you are invalidating X and they should too.

Sounds like coding ourselves the replication part of redis.

> Other than this, we can see about improving the memcached backend and see if 
> that works and/or seeing if with membase and the newer memcached pooling stuff
> it is fast enough?

The memcache pooling solved the moxi issue, and it's certainly faster since we don't create new sockets on each call.

So we could try to load test and see how it goes maybe ?

Do you have a dump of the calls made to redis ?

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 18

•

12 years ago


(In reply to Tarek Ziadé (:tarek) from comment #17)
> > Using redis locally and when invalidating, post to the other webheads to say you are invalidating X and they should too.
> 
> Sounds like coding ourselves the replication part of redis.
 
It isn't really replication as it is only sending the fact that the other datacenter should look up the set of memcached keys to invalidate in its local redis instance and then invalidate that.

> > Other than this, we can see about improving the memcached backend and see if 
> > that works and/or seeing if with membase and the newer memcached pooling stuff
> > it is fast enough?
> 
> The memcache pooling solved the moxi issue, and it's certainly faster since
> we don't create new sockets on each call.
> 
> So we could try to load test and see how it goes maybe ?
> 
> Do you have a dump of the calls made to redis ?

I need to remove the code that is still writing to redis before we could get a decent dump of call data to redis. That is in bug 771593 and I am currently working on that. It will leave only cache-machine using redis.

Wraithan (Chris McDonald) [:wraithan]

Reporter

Comment 19

•

12 years ago

Milestone is long passed. This is still needed for multihomed Marketplace but appears to have been deprioritized so payments can get done.

Target Milestone: 2012-09-06 → ---

Wraithan (Chris McDonald) [:wraithan]

Reporter

Updated

•

11 years ago

Priority: -- → P3

Wraithan (Chris McDonald) [:wraithan]

Reporter

Updated

•

11 years ago

Assignee: wraithan → nobody

Jeremy Orem [:oremj]

Comment 20

•

11 years ago

Is this going to be possible? Do we want to keep this bug open or close it and build something that allows us to do multi-region invalidation?

Wil Clouser [:clouserw]

Comment 21

•

11 years ago

It's still valid, just not at the top of our todo list.

Priority: P3 → P5

Wil Clouser [:clouserw]

Comment 22

•

10 years ago

Thanks for filing this.  Due to resource constraints we are closing bugs which we won't realistically be able to fix.  If you have a patch that applies to this bug please reopen.

For more info see http://micropipes.com/blog/2014/09/24/the-great-add-on-bug-triage/

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

8 years ago

Product: addons.mozilla.org → addons.mozilla.org Graveyard