1759554 - remove django-cache-memoize

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Description

•

4 years ago

Python has a memoization decorator that you can use to wrap a function and it caches arguments -> return value. Future calls of the function will pull from the cache.

Tecken uses a cache_memoize decorator that does roughly the same thing, except that it uses Django's caching system so calls are cached across the cluster.

It kind of looks like that library is unmaintained now. Further, I claim that the caching is being done in the wrong place. Instead of it being done in download/upload code, it should be done in the storage code. Each storage item should be in charge of its own caching.

One big advantage to this is that it makes it much easier to invalidate the correct cache when the need arises.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Updated

•

3 years ago

Blocks: 1564452

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Updated

•

2 years ago

Blocks: 1687804

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 1

•

2 years ago

Grabbing this because it blocks refactoring the object storage interface.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 2

•

2 years ago

Bug 1564452 will cover redoing the storage API layer. If that work drops this, then that's cool, but this doesn't need to block that bug. Further, I'm not going to work on it any time soon, so I'm unassigning myself.

Assignee: willkg → nobody

No longer blocks: 1564452

Status: ASSIGNED → NEW

Bianca Danforth [:bdanforth]

Assignee

Updated

•

2 years ago

Assignee: nobody → bdanforth

Bianca Danforth [:bdanforth]

Assignee

Comment 3

•

2 years ago

•

Edited

:willkg: I paired with Sven on this earlier this morning, and we have a proposal given our findings thus far that we'd like to get your opinion on. What do you think about the options below and the recommendation?

TL;DR For various reasons, the built-in lru_cache from functools is not a good replacement for the django_cache_memoize library. Can we instead forgo caching entirely in this case to simplify the code and infrastructure (i.e. removing Redis)? We estimate an average increase of about 30ms per download request roundtrip (which is typically on the order of hundreds of ms)[1].

Why one cannot simply replace `cache_memoize` with `lru_cache`

cache_memoize uses the Django cache layer with a Redis backend, which is shared across Tecken instances. By contrast, lru_cache is an in-memory cache that is per-instance. This means we would have less caching overall as the same request isn't guaranteed to go to the same instance. Currently about 75% of requests are cache hits[2].
Currently, we cache everything for 6 hours and invalidate individual keys on file upload[3]. lru_cache does not provide a way to invalidate a single item from the cache. You can only specify the max_size (i.e. the max number of items in the cache) and clear the entire cache.

Options

Option 1: Remove caching entirely

PROs

Removes unmaintained library
Less complexity in the code
Less complexity in our infrastructure (we could remove Redis)
No need for cache hit/cache miss metrics

CONs

Download requests are ~30ms slower

Option 2: Replace `cache_memoize` with Python `lru_cache`

PROs

Removes unmaintained library
Some caching still occurs
Less complexity in our infrastructure (we could remove Redis)

CONs

Less caching than before as a per-instance cache
No control over how long items are cached for
Cannot invalidate an individual item
No cache hit/cache miss metrics out of the box

Option 3: Write our own wrapper around `django_redis.cache.RedisCache`

PROs

Removes unmaintained library
Same level of caching as before

CONs

Need to write/maintain our own wrapper code
No cache hit/cache miss metrics out of the box

Recommendation: Option 1

The simplest solution is to remove the library without replacement. Tecken will be idle and download requests will be slightly longer, but we could remove this code, all the Redis infrastructure code, any cache monitoring panels/alerts and any further maintenance burden.

[1]: This decorator is used only for the HEAD request to S3 to check that an object exists for symbols download and upload, which is one of multiple requests made as part of a download or upload request. The 30ms increase estimate is based on the mean SymbolDownloader Timings of 40ms and the fact that 75% of these S3 HEAD requests are cache hits (see [2]).
[2]: Based on the SymbolDownloader Cache Hits panel in the Tecken App Metrics dashboard in Grafana
[3]: Based on use in get_last_modified function, the value of SYMBOLDOWNLOAD_EXISTS_TTL_SECONDS is set in settings.py. Based on get_file_md5_hash, we invalidate a key on upload.

Flags: needinfo?(willkg)

Bianca Danforth [:bdanforth]

Assignee

Updated

•

2 years ago

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 4

•

2 years ago

Interesting! When I wrote up this bug, I was thinking I'd remove the cache-memoize code and replace it with caching in the storage backend because that seemed straight-forward to implement and maintain and the download API is heavily used.

Based on this analysis, option 1 seems fine. Eliot and the Socorro processor both use the Tecken download API and would be affected. My gut feeling is that Socorro processor throughput won't change much because (if I recall correctly) the stackwalker parallelizes symbol file downloads. Eliot symbolication API might be a different story.

We should build a list of things to keep an eye on so we know if removing the cache degraded something more than we would like.

Flags: needinfo?(willkg)

Bianca Danforth [:bdanforth]

Assignee

Comment 5

•

2 years ago

•

Edited

Thank you Willkg for your thoughts. FYI, I added some references to my previous comment to help explain what I'm basing the data/estimates from.

We should build a list of things to keep an eye on so we know if removing the cache degraded something more than we would like.

For Eliot, we could monitor the existing SYM download timings panel in the Eliot App Metrics dashboard. For the last 90 days, I estimate the mean download timing to be around 250ms. However, IIUC Eliot can make use of other sources for downloading SYM files than Tecken, and I don't see that this metric can currently be broken down by source. Is there another way to determine symbols download timings using Tecken as the source? Else should this dimension be added to this existing metric?

For the Socorro processor, I'm less certain that there's an existing metric we could watch, though from reading the processor docs (and looking at the code/stackwalker's code), the stackwalker does cache fully downloaded SYM files. I saw that there's a Stackwalker timings panel in the Socorro Prod App Metrics dashboard, though I imagine those timings measure the symbolication process in its entirety and it would be difficult to suss out the component related to downloading a SYM file.

My gut feeling is that Socorro processor throughput won't change much because (if I recall correctly) the stackwalker parallelizes symbol file downloads. Eliot symbolication API might be a different story.

I can leave this ticket open for a month or two after deploying the patch in prod to monitor Eliot as described above (similar to Sven's approach for Bug 1865415). Would that be acceptable?

Flags: needinfo?(willkg)

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 6

•

2 years ago

We don't have granular metrics for the stackwalker--all we have is "how long did it take to run?". I made some changes to how it runs last year that make it easier to generate granular metrics, but we'd need to do some additional work to get there. The stackwalker has two sources for symbols: the symbols cache on disk and Tecken. It doesn't pull symbols from anywhere else. I don't know what the hit/miss rate for the symbols cache on disk is.

Eliot can use other symbols providers, but it's currently configured like the stackwalker where it's configured to use a symbols cache on disk and Tecken. Using the SYM download timings panel works for Eliot. If there was a change, we'd probably see it there.

Those sound like good things to look at. There might be other things.

Sometimes, I throw together an analysis dashboard in Grafana to pull things into one place. For example:

socorro processor analysis: https://earthangel-b40313e5.influxcloud.net/d/FMW5L0e4k/socorro-processor-analysis-2023-08-09?orgId=1&refresh=1m
tecken upload api metrics: https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics?orgId=1

Flags: needinfo?(willkg)

Bianca Danforth [:bdanforth]

Assignee

Comment 7

•

2 years ago

•

Edited

Thank you again Willkg. To summarize, my plans for this ticket are:

Proceed with Option 1: Remove caching entirely.
Prior to merging the patch for 1), create a temporary dashboard pulling together the following existing panels across our Crash Ingestion dashboards (and link to it in this bug):
a. Socorro processor: stackwalker timings,
b. Tecken: SymbolDownloader Timings,
c. Tecken: Download requests (mean, 95, 99) (ms) and
d. Eliot: SYM download timings.
Monitor these metrics in stage then prod for at least a month.
If there are no adverse effects outside of the expected bounds (~30ms), proceed with removing Redis config/infrastructure and close this ticket. Else discuss next steps.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 8

•

2 years ago

My gut feeling is that we probably don't need to monitor them for a month--I suspect we'd know within a few days whether removing the cache adversely affected things. If we don't know by then, the effect is probably negligible and indistinguishable from background radiation.

Sven Marnach :sven

Comment 9

•

2 years ago

It's probably not relevant anymore, but I'd like to add another problem with per-instance in-memory caching: Cache invalidation becomes really hard, since the instance that handles an upload has to invalidate the caches on all instances, so we'd need some kind of broadcast mechanism. Removing the cache entirely sounds good to me.

Bianca Danforth [:bdanforth]

Assignee

Comment 10

•

2 years ago

•

Edited

Thanks Willkg and Sven. I'll plan to monitor for a much shorter period of time, and that's a good point about cache invalidation becoming more difficult.

I've created the temporary "Bug 1759554: remove django-cache-memoize" Grafana dashboard to monitor the impact of this change with the above panels copied over (thanks Willkg for showing me an easy way to copy over the panels). I've put the dashboard in my personal bdanforth folder.

Bianca Danforth [:bdanforth]

Assignee

Comment 11

•

2 years ago

Attached file [mozilla-services/tecken] bug-1759554: remove django-cache-memoize (#2878) — Details

Bianca Danforth [:bdanforth]

Assignee

Comment 12

•

2 years ago

biancadanforth merged PR [mozilla-services/tecken]: bug-1759554: remove django-cache-memoize (#2878) in 34b9fe9.

Bianca Danforth [:bdanforth]

Assignee

Comment 13

•

2 years ago

This is in stage, but willkg suggested we run the systemtests in stage before deploying this to production. Those tests first need to be updated, so I'm picking up bug 1866796 before I test this patch in prod.

Comment 14

•

2 years ago

•

Edited

This was deployed just now in bug #1880345 in v2024.02.14.

I checked the Tecken App Metrics dashboard and it looks ok. The download timings have gone up, but that's expected:

https://earthangel-b40313e5.influxcloud.net/d/a9-7FT0Zk/tecken-app-metrics?orgId=1&from=1707937354237&to=1707939067521

I also checked the Socorro processor timing metrics. They don't seem noticeably different, so I think we're good.

https://earthangel-b40313e5.influxcloud.net/d/LysVjx8Zk/socorro-prod-app-metrics?orgId=1&from=1707937127592&to=1707939684411

Flags: needinfo?(bdanforth)

Bianca Danforth [:bdanforth]

Assignee

Comment 15

•

2 years ago

Thanks Willkg for the deploy and for checking the dashboards.

Prior to the deploy, I updated the system tests in bug 1866796 (PR in review) and ran them against all envs (local, stage, and prod) with these changes, and they all passed.

I've also reviewed the panels, and they look okay to me as well (interestingly, the bug-specific dashboard I created is missing two panels and a few sections I had created...). I see no or minor increases in latency as expected and within the expected range (~25ms). Marking this ticket as resolved.

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Flags: needinfo?(bdanforth)

Resolution: --- → FIXED

remove django-cache-memoize

Why one cannot simply replace cache_memoize with lru_cache

Options

Option 1: Remove caching entirely

Option 2: Replace cache_memoize with Python lru_cache

Option 3: Write our own wrapper around django_redis.cache.RedisCache

Recommendation: Option 1

Why one cannot simply replace `cache_memoize` with `lru_cache`

Option 2: Replace `cache_memoize` with Python `lru_cache`

Option 3: Write our own wrapper around `django_redis.cache.RedisCache`