How much is caching helping us for raw and processed crashes

NEW
Unassigned

Status

Socorro
Webapp
2 years ago
9 months ago

People

(Reporter: peterbe, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(7 attachments)

(Reporter)

Description

2 years ago
Created attachment 8798433 [details]
Screen Shot 2016-10-06 at 10.34.37 AM.png

Some quick inspection into the New Relic graphs indicate that a LOT of time is spent in memcache set and memcache get for the report indexing. About 33% by a rough estimate. 

For every fetch on the RawCrash and UnredactedCrash model we use a 1h memcache. The presumption is that memcache is faster at reading from than S3. But how true is that? Locally it's impossible to test because you'll run your memcache on the same machine and the S3 bucket is in the AWS datacenter. 

DDOS attacks or mild stampeding herds (or mis-configured curl scripts) is a motivating factor for heavy caching. But we have rate limitation on every web view that blocks excessive resource usage.
(Reporter)

Comment 1

2 years ago
Ultimately the motivation of this bug is...

1) Is memcache not making the public API and the report index web views better (...specifically for raw and processed crash JSON)?
2) If so, do something about it to make those views much faster.
(Reporter)

Comment 2

2 years ago
Created attachment 8798435 [details]
Screen Shot 2016-10-06 at 10.33.06 AM.png

We spend an awful amount of time doing memcache set across the whole app.
(Reporter)

Comment 3

2 years ago
Created attachment 8815399 [details]
Screen Shot 2016-11-29 at 1.55.32 PM.png

Here's more fuel for the bug. All that blue is due to Top Crasher being dependent on SuperSearch whose parameter always has a date parameter that looks something like this::

  u'date': ['>=2016-11-15T18:57:26+00:00', '<2016-11-22T18:57:26+00:00']

Every time I refresh the Top Crasher page, that date parameter changes down to a second granularity.
(Reporter)

Comment 4

2 years ago
Created attachment 8815401 [details]
Screen Shot 2016-11-29 at 1.59.15 PM.png

Here's the breakdown for all model_wrapper calls. In other words, 38.1% of the time for every model queried is spent talking to memcached.
(Reporter)

Comment 5

2 years ago
Created attachment 8815407 [details]
Screen Shot 2016-11-29 at 2.00.52 PM.png

Of the "heavy" endpoints...

* SuperSearch:           11% cache hits
* SuperSearchUnredacted: 11.5% cache hits
* ProcessedCrash:        0.4% cache hits
* RawCrash:              64% cache hits
* UnredactedCrash:       62% cache hits
* ADI:                   33% cache hits

Another very important thing to remember is that for any of the RawCrash, ProcessedCrash and UnredactedCrash (i.e. the components of the report index pages) the caching is likely just to be a hindrance if you've done reprocessing. 

Also note that every time there's a cache MISS, not only do we do a "failing" cache.get() we also have to do a cache.set() which needs to possibly send a LOT of data over to memcached.
(Reporter)

Comment 7

2 years ago
At this point I'm confident the caching is NOT helping us. 
The only strange one is ADI. ADI is used for the home page and the Crashes per Day and its parameters uses a date without time. We could increase the cache time on that one but it might actually be detrimental for people who click around a couple of hours after UTC midnight.
Assignee: nobody → peterbe
We talked about this in the meeting. Figured I'd codify my thoughts in a comment.

The screenshots of cache hit/miss data are interesting, but they only tell you the total number of cache hits/misses over an unspecified period of time. They don't surface things like "we had a spike of hits on such-and-such a date". Maybe memcache helps a lot in the week or two after a release? I'm not sure we can discern that from the analysis done so far.

I think we should add a statsd.histogram() thing for hits and misses so that we can see hits/misses over time in datadog. That'd ell us a lot more about what's going on.

Having said that, I think this is a lot of work overall and I think the current situation is fine, so I'm not sure the impact is compelling to keep working on this right now.
(Reporter)

Updated

9 months ago
Assignee: peterbe → nobody
You need to log in before you can comment on or make changes to this bug.