audit cache usage in webapp
Categories
(Socorro :: Webapp, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
The Crash Stats webapp uses memcached as a caching backend. Recently, we had a case where we hit the max limit in the cache and then the site had an outage while the cache cleaned itself up.
Why'd we hit a max limit? What's in the cache? How long are things in the cache for?
| Assignee | ||
Comment 1•3 years ago
|
||
Grabbing this to look into. We want to audit caching in the webapp and then potentially switch to Redis.
| Assignee | ||
Comment 2•2 years ago
|
||
I spent a few hours looking at all the caching code across all the Socorro services.
Antenna (collector) doesn't use memcache.
Processor doesn't use memcache.
Webapp / Crontabber:
Some code uses djanto.utils.functional.cached_property, but that caches in memory and not in memcache.
crashstats/crashstats/utils.py:- caches versions for product for menus
crashstats/crashstats/views.py:- uses
RawCrashandProcessedCrashAPIs which cache data from S3:- is fetching data from S3 expensive? why are we caching this?
- maybe we only cache the view output for anonymous users? then we avoid possible thundering herd problems with generating the data.
- maybe we change the caching time to 5 minutes?
- caches whether it sent a priority job reprocessing request or not
- uses
crashstats/crashstats/templatetags/jinja_helpers.py:- looks at the cache for bug details--something else is putting the data in cache
- the
BugzillaBugInfomodel puts bug information data in the cache - looks like nothing else puts data in this cache--this is weird and probably worth revisiting
crashstats/crashstats/models.py:get_api_allowlistisn't used--we can remove itmeasure_fetchesuses the cache as a scoreboard for cache hits/misses and timings which you can view in the admin- do we need this? we get some/most/all of this information in Grafana already
SocorroCommon.fetch()andSocorroMiddleware.fetch()caches results forcache_secondswhich defaults to 1 hour:- all the RawCrash fetches (report view, RawCrash API)
- all the ProcessedCrash fetches (report view, ProcessedCrash API)
- all SuperSearch fetches (used in tons of stuff, super search view, SuperSearch API)
crashstats/supersearch/models.py:get_api_allowlistcaches the list of fields in cache for 1 hour:- since this isn't in the database anymore, we should switch this to cache in memory
Things we should change:
- drop default cache time for
SocorroCommonfrom 1 hour to 5 minutes; we want it for thundering herd issues, but caching for an hour is excessive - remove
crashstats/crashstats/models.py:get_api_allowlistbecause it's not used - rewrite
crashstats/supersearch/models.py:get_api_allowlistso it stores results in memory - remove
measure_fetchesand the analyze model fetches admin page - (maybe) stop caching
RawCrashandProcessedCrashbecause the cache isn't saving work
I'll write up bugs for those things and start working through them.
| Assignee | ||
Comment 3•2 years ago
|
||
I finished the audit and broke out the work into separate bugs, so I'm done with this.
| Assignee | ||
Comment 4•2 years ago
|
||
I wrote up a post-mortem document covering the impetus for the audit, the audit itself, action items, and post-mortem analysis of whether it worked out.
Summary: we reduced usage of the cache without adversely affecting the webapp CPU usage or page view timings.
Description
•