Closed Bug 1794140 Opened 3 years ago Closed 2 years ago

audit cache usage in webapp

Categories

(Socorro :: Webapp, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

The Crash Stats webapp uses memcached as a caching backend. Recently, we had a case where we hit the max limit in the cache and then the site had an outage while the cache cleaned itself up.

Why'd we hit a max limit? What's in the cache? How long are things in the cache for?

Grabbing this to look into. We want to audit caching in the webapp and then potentially switch to Redis.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

I spent a few hours looking at all the caching code across all the Socorro services.

Antenna (collector) doesn't use memcache.

Processor doesn't use memcache.

Webapp / Crontabber:

Some code uses djanto.utils.functional.cached_property, but that caches in memory and not in memcache.

  • crashstats/crashstats/utils.py:
    • caches versions for product for menus
  • crashstats/crashstats/views.py:
    • uses RawCrash and ProcessedCrash APIs which cache data from S3:
      • is fetching data from S3 expensive? why are we caching this?
      • maybe we only cache the view output for anonymous users? then we avoid possible thundering herd problems with generating the data.
      • maybe we change the caching time to 5 minutes?
    • caches whether it sent a priority job reprocessing request or not
  • crashstats/crashstats/templatetags/jinja_helpers.py:
    • looks at the cache for bug details--something else is putting the data in cache
    • the BugzillaBugInfo model puts bug information data in the cache
    • looks like nothing else puts data in this cache--this is weird and probably worth revisiting
  • crashstats/crashstats/models.py:
    • get_api_allowlist isn't used--we can remove it
    • measure_fetches uses the cache as a scoreboard for cache hits/misses and timings which you can view in the admin
      • do we need this? we get some/most/all of this information in Grafana already
    • SocorroCommon.fetch() and SocorroMiddleware.fetch() caches results for cache_seconds which defaults to 1 hour:
      • all the RawCrash fetches (report view, RawCrash API)
      • all the ProcessedCrash fetches (report view, ProcessedCrash API)
      • all SuperSearch fetches (used in tons of stuff, super search view, SuperSearch API)
  • crashstats/supersearch/models.py:
    • get_api_allowlist caches the list of fields in cache for 1 hour:
      • since this isn't in the database anymore, we should switch this to cache in memory

Things we should change:

  1. drop default cache time for SocorroCommon from 1 hour to 5 minutes; we want it for thundering herd issues, but caching for an hour is excessive
  2. remove crashstats/crashstats/models.py:get_api_allowlist because it's not used
  3. rewrite crashstats/supersearch/models.py:get_api_allowlist so it stores results in memory
  4. remove measure_fetches and the analyze model fetches admin page
  5. (maybe) stop caching RawCrash and ProcessedCrash because the cache isn't saving work

I'll write up bugs for those things and start working through them.

Depends on: 1808591
Depends on: 1808592
Depends on: 1808593
Depends on: 1808594
Depends on: 1808595

I finished the audit and broke out the work into separate bugs, so I'm done with this.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

I wrote up a post-mortem document covering the impetus for the audit, the audit itself, action items, and post-mortem analysis of whether it worked out.

Summary: we reduced usage of the cache without adversely affecting the webapp CPU usage or page view timings.

You need to log in before you can comment on or make changes to this bug.