Closed Bug 776156 Opened 12 years ago Closed 10 years ago

cache machine relies on redis for the scale of zamboni

Categories

(addons.mozilla.org Graveyard :: Code Quality, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: wraithan, Unassigned)

References

Details

Currently Zamboni uses cache machine to do general caching and some semi-smart invalidation. It has the option of using memcached or redis for its invalidation. When using redis it uses a set to keep track of all the queries that need to be invalidated when a given row changes. In memcached it has to retrieve the list, turn it into a set, then add to the set, then put it back into memcached.

The problem with the memcached way is all the extra strain we put on memcached and the network when reading/writing larger invalidation sets like the ones for add-ons like Adblock Plus.

The problem with the redis way is we don't have a good way to distribute it across availability zones. Even if we did, it would give us another point of failure to leave it in for one part of our code base when we've been able to remove it from the rest of the codebase.

Potential solutions:
* append only to the invalidation set
  * this comes with the problem of getting too large for the 1mb bucket and no deduplication.
    * partially solvable by writing to something that keeps track of the keys in a bucket and starts new buckets when the current one gets too bug. (ugly/non-ideal)

* stop invalidating and just wait for timeout
  * This is 8mins and janky. Users wont get immediate feedback when updating something that needs to invalid cache everywhere.

* sync redis instances, even if they are ephemeral (instead of persistent) across datacenters.
  * this is not idea and adds a point of failure

* something else!
  * hopefully this is what we choose.
OS: Mac OS X → All
Hardware: x86 → All
Blocks: 749335
The cache machine is at https://github.com/jbalogh/django-cache-machine/.

We're currently using this to track what queries we need to invalidate when something changes.  For example, when a search for "privacy" returns "adblock plus" we append adblock's ID to set X.  Set X is the md5sum of the query for the search for privacy.  When adblock's description changes, we flush all the queries where its ID exists in the list (thus, instant changes on the site).

This method is used for all queries on the entire site - an option may be to pare this down to queries that are hit a lot and then just let the rest go through to the db slaves.

I'm CCing some people who will hopefully ask the right questions or suggest some other options. :)
Gonna ask some obvious and dumb questions here. Bear with me, sorry:

1) What's the match criteria for a hit? Which fields?
2) What is the underlying engine that is doing the matching if there isn't a cache hit?
3) How many records are in the potential set?
4) How many records need to come back for a query?
Oh, and 5) What sort of traffic scale (qpd) are we dealing with here? How long is the tail?
1) The match criteria is basically just an md5 of the string representation of the query.

5) We do about 1 billion MySQL queries, 2 billion redis operations, and 2.5 billion memcache gets/sets per day.
For match, I meant if we don't have a cache hit. For example, if I type 'chicken' into the search box, I get a couple chicken items, then a bunch of Chicago stuff. I was curious what the underlying engine for that was, and what fields it used.

Also, how often are the caches invalidated? It doesn't seem like Adblocker would update its description all that often.
(In reply to Toby Elliott [:telliott] from comment #5)
> For match, I meant if we don't have a cache hit. For example, if I type
> 'chicken' into the search box, I get a couple chicken items, then a bunch of
> Chicago stuff. I was curious what the underlying engine for that was, and
> what fields it used.

Search is powered by Elastic Search which indexes the data out of the database.  Cache misses will hit elastic search.  The logic for what to return is an algorithm ES uses which combines certain fields with certain weights: description, summary, popularity, etc.

> Also, how often are the caches invalidated? It doesn't seem like Adblocker
> would update its description all that often.

This is the key here I think.  Our troubles are with the invalidation records, not with the actual query caches (if I understand it correctly).  I think oremj said there was a global invalidation timeout of 8 minutes but I'd need him to verify.

Maybe it would help to look past what we're currently doing and just look at the end goal:  To be able to have someone change the adblock plus description and have it updated instantly everywhere.  Or, as to your point about the description not changing often, consider the same scenario except with ratings - an add-on gets a new rating, we want that number to show up everywhere.
This is for generic ORM level caching. So this applies to every mysql query that is executed against the database (unless they are specifically excluded), not just search or certain views.

1) I believe it goes by returned object ids, but I could be wrong, I need to dig into cache machine's source to see exactly what it is doing.

2) Cache hit we go to memchache, cache miss we go to MySQL then store in memcache and keep the list of keys to remove from memcached when something changes in a redis set.

3 & 4) I need to talk to clouserw about this and see if we can figure it out.
Going to attempt to build a invalidation backend that uses elastic search.
Assignee: nobody → xwraithanx
Target Milestone: --- → 2012-08-23
Patch is supposed to be done today.  Moving to next week for some load testing
Target Milestone: 2012-08-23 → 2012-08-30
I had what I thought was a complete patch last Friday. I couldn't get all the unit tests to pass though. I spoke with robhudson on monday and found a flaw in my thinking. I ended up needing to use the ES Update API which isn't in pyes nor elasticutils.

I investigated adding it to them but found that elasticutils is looking to move off of pyes so I've ended up doing the queries directly against ES instead of using a library.
Forgot to add timeline stuff.

Have it written using queries I've written myself, have a couple tests still failing. Should be done today and I'll ping alexis about getting it load tested.
Target Milestone: 2012-08-30 → 2012-09-06
bumped to next week.  Wraithan is going to comment with status update
https://github.com/wraithan/django-cache-machine/commit/748819a

This passes all the tests. It is a 10x slowdown when running the tests, I don't know about how it performs in the real world as I didn't load test it. ES's update api doesn't work like I expected it to. Indexing isn't as instance as we need for search stuff so you either have to access the elements one by one for refresh ES, both are slow.

For this use case I've found that ES wont work for us. I am still talking to the folks in #elasticutils about how I could make this faster but I'm going to consider the ES backend a dead end.
Solutions that would work ok. Using redis locally and when invalidating, post to the other webheads to say you are invalidating X and they should too.

See where redis clustering is and if it exists, have a master in each datacenter and a slave for each master in all the datacenters and iterate through them to get everything that should be invalidated. This gives us the speed of local writes for adding things to the invalidation lists, and when something is invalidated it requires 1n writes for each master. We'd have to profile to see how often we are invalidating compared to how often we are adding to the lists to see if this is acceptable.

Other than this, we can see about improving the memcached backend and see if that works and/or seeing if with membase and the newer memcached pooling stuff it is fast enough?
Oh, finally we can rip out cache-machine and do less in the way of generic caching and instead manually cache the hottest spots and see how that goes. But that sounds really scary.
The more I think about it, the more I'd like to see cache machine go away as we move to using only Elastic Search on the front end. 

That's where the traffic is, things like devhub and admin can access the backend as seperate sites and will not have the traffic that necessitates much of cache machine, maybe just the odd sprinkling of caching here and there as Wraithan says.

Now that brings up an interesting question of caching and invalidation for elastic search, but I think that's a different problem because of the data structures.
> Using redis locally and when invalidating, post to the other webheads to say you are invalidating X and they should too.

Sounds like coding ourselves the replication part of redis.

> Other than this, we can see about improving the memcached backend and see if 
> that works and/or seeing if with membase and the newer memcached pooling stuff
> it is fast enough?

The memcache pooling solved the moxi issue, and it's certainly faster since we don't create new sockets on each call.

So we could try to load test and see how it goes maybe ?

Do you have a dump of the calls made to redis ?

(In reply to Tarek Ziadé (:tarek) from comment #17)
> > Using redis locally and when invalidating, post to the other webheads to say you are invalidating X and they should too.
> 
> Sounds like coding ourselves the replication part of redis.
 
It isn't really replication as it is only sending the fact that the other datacenter should look up the set of memcached keys to invalidate in its local redis instance and then invalidate that.

> > Other than this, we can see about improving the memcached backend and see if 
> > that works and/or seeing if with membase and the newer memcached pooling stuff
> > it is fast enough?
> 
> The memcache pooling solved the moxi issue, and it's certainly faster since
> we don't create new sockets on each call.
> 
> So we could try to load test and see how it goes maybe ?
> 
> Do you have a dump of the calls made to redis ?

I need to remove the code that is still writing to redis before we could get a decent dump of call data to redis. That is in bug 771593 and I am currently working on that. It will leave only cache-machine using redis.
Milestone is long passed. This is still needed for multihomed Marketplace but appears to have been deprioritized so payments can get done.
Target Milestone: 2012-09-06 → ---
Priority: -- → P3
Assignee: wraithan → nobody
Is this going to be possible? Do we want to keep this bug open or close it and build something that allows us to do multi-region invalidation?
It's still valid, just not at the top of our todo list.
Priority: P3 → P5
Thanks for filing this.  Due to resource constraints we are closing bugs which we won't realistically be able to fix.  If you have a patch that applies to this bug please reopen.

For more info see http://micropipes.com/blog/2014/09/24/the-great-add-on-bug-triage/
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.