remove django-cache-memoize
Categories
(Tecken :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: bdanforth)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
Python has a memoization decorator that you can use to wrap a function and it caches arguments -> return value. Future calls of the function will pull from the cache.
Tecken uses a cache_memoize decorator that does roughly the same thing, except that it uses Django's caching system so calls are cached across the cluster.
It kind of looks like that library is unmaintained now. Further, I claim that the caching is being done in the wrong place. Instead of it being done in download/upload code, it should be done in the storage code. Each storage item should be in charge of its own caching.
One big advantage to this is that it makes it much easier to invalidate the correct cache when the need arises.
| Reporter | ||
Comment 1•2 years ago
|
||
Grabbing this because it blocks refactoring the object storage interface.
| Reporter | ||
Comment 2•2 years ago
|
||
Bug 1564452 will cover redoing the storage API layer. If that work drops this, then that's cool, but this doesn't need to block that bug. Further, I'm not going to work on it any time soon, so I'm unassigning myself.
| Assignee | ||
Updated•2 years ago
|
| Assignee | ||
Comment 3•2 years ago
•
|
||
:willkg: I paired with Sven on this earlier this morning, and we have a proposal given our findings thus far that we'd like to get your opinion on. What do you think about the options below and the recommendation?
TL;DR For various reasons, the built-in lru_cache from functools is not a good replacement for the django_cache_memoize library. Can we instead forgo caching entirely in this case to simplify the code and infrastructure (i.e. removing Redis)? We estimate an average increase of about 30ms per download request roundtrip (which is typically on the order of hundreds of ms)[1].
Why one cannot simply replace cache_memoize with lru_cache
cache_memoizeuses the Django cache layer with a Redis backend, which is shared across Tecken instances. By contrast,lru_cacheis an in-memory cache that is per-instance. This means we would have less caching overall as the same request isn't guaranteed to go to the same instance. Currently about 75% of requests are cache hits[2].- Currently, we cache everything for 6 hours and invalidate individual keys on file upload[3].
lru_cachedoes not provide a way to invalidate a single item from the cache. You can only specify themax_size(i.e. the max number of items in the cache) and clear the entire cache.
Options
Option 1: Remove caching entirely
PROs
- Removes unmaintained library
- Less complexity in the code
- Less complexity in our infrastructure (we could remove Redis)
- No need for cache hit/cache miss metrics
CONs
- Download requests are ~30ms slower
Option 2: Replace cache_memoize with Python lru_cache
PROs
- Removes unmaintained library
- Some caching still occurs
- Less complexity in our infrastructure (we could remove Redis)
CONs
- Less caching than before as a per-instance cache
- No control over how long items are cached for
- Cannot invalidate an individual item
- No cache hit/cache miss metrics out of the box
Option 3: Write our own wrapper around django_redis.cache.RedisCache
PROs
- Removes unmaintained library
- Same level of caching as before
CONs
- Need to write/maintain our own wrapper code
- No cache hit/cache miss metrics out of the box
Recommendation: Option 1
The simplest solution is to remove the library without replacement. Tecken will be idle and download requests will be slightly longer, but we could remove this code, all the Redis infrastructure code, any cache monitoring panels/alerts and any further maintenance burden.
[1]: This decorator is used only for the HEAD request to S3 to check that an object exists for symbols download and upload, which is one of multiple requests made as part of a download or upload request. The 30ms increase estimate is based on the mean SymbolDownloader Timings of 40ms and the fact that 75% of these S3 HEAD requests are cache hits (see [2]).
[2]: Based on the SymbolDownloader Cache Hits panel in the Tecken App Metrics dashboard in Grafana
[3]: Based on use in get_last_modified function, the value of SYMBOLDOWNLOAD_EXISTS_TTL_SECONDS is set in settings.py. Based on get_file_md5_hash, we invalidate a key on upload.
| Assignee | ||
Updated•2 years ago
|
| Reporter | ||
Comment 4•2 years ago
|
||
Interesting! When I wrote up this bug, I was thinking I'd remove the cache-memoize code and replace it with caching in the storage backend because that seemed straight-forward to implement and maintain and the download API is heavily used.
Based on this analysis, option 1 seems fine. Eliot and the Socorro processor both use the Tecken download API and would be affected. My gut feeling is that Socorro processor throughput won't change much because (if I recall correctly) the stackwalker parallelizes symbol file downloads. Eliot symbolication API might be a different story.
We should build a list of things to keep an eye on so we know if removing the cache degraded something more than we would like.
| Assignee | ||
Comment 5•2 years ago
•
|
||
Thank you Willkg for your thoughts. FYI, I added some references to my previous comment to help explain what I'm basing the data/estimates from.
We should build a list of things to keep an eye on so we know if removing the cache degraded something more than we would like.
For Eliot, we could monitor the existing SYM download timings panel in the Eliot App Metrics dashboard. For the last 90 days, I estimate the mean download timing to be around 250ms. However, IIUC Eliot can make use of other sources for downloading SYM files than Tecken, and I don't see that this metric can currently be broken down by source. Is there another way to determine symbols download timings using Tecken as the source? Else should this dimension be added to this existing metric?
For the Socorro processor, I'm less certain that there's an existing metric we could watch, though from reading the processor docs (and looking at the code/stackwalker's code), the stackwalker does cache fully downloaded SYM files. I saw that there's a Stackwalker timings panel in the Socorro Prod App Metrics dashboard, though I imagine those timings measure the symbolication process in its entirety and it would be difficult to suss out the component related to downloading a SYM file.
My gut feeling is that Socorro processor throughput won't change much because (if I recall correctly) the stackwalker parallelizes symbol file downloads. Eliot symbolication API might be a different story.
I can leave this ticket open for a month or two after deploying the patch in prod to monitor Eliot as described above (similar to Sven's approach for Bug 1865415). Would that be acceptable?
| Reporter | ||
Comment 6•2 years ago
|
||
We don't have granular metrics for the stackwalker--all we have is "how long did it take to run?". I made some changes to how it runs last year that make it easier to generate granular metrics, but we'd need to do some additional work to get there. The stackwalker has two sources for symbols: the symbols cache on disk and Tecken. It doesn't pull symbols from anywhere else. I don't know what the hit/miss rate for the symbols cache on disk is.
Eliot can use other symbols providers, but it's currently configured like the stackwalker where it's configured to use a symbols cache on disk and Tecken. Using the SYM download timings panel works for Eliot. If there was a change, we'd probably see it there.
Those sound like good things to look at. There might be other things.
Sometimes, I throw together an analysis dashboard in Grafana to pull things into one place. For example:
- socorro processor analysis: https://earthangel-b40313e5.influxcloud.net/d/FMW5L0e4k/socorro-processor-analysis-2023-08-09?orgId=1&refresh=1m
- tecken upload api metrics: https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics?orgId=1
| Assignee | ||
Comment 7•2 years ago
•
|
||
Thank you again Willkg. To summarize, my plans for this ticket are:
- Proceed with Option 1: Remove caching entirely.
- Prior to merging the patch for 1), create a temporary dashboard pulling together the following existing panels across our Crash Ingestion dashboards (and link to it in this bug):
a. Socorro processor: stackwalker timings,
b. Tecken: SymbolDownloader Timings,
c. Tecken: Download requests (mean, 95, 99) (ms) and
d. Eliot: SYM download timings. - Monitor these metrics in stage then prod for at least a month.
- If there are no adverse effects outside of the expected bounds (~30ms), proceed with removing Redis config/infrastructure and close this ticket. Else discuss next steps.
| Reporter | ||
Comment 8•2 years ago
|
||
My gut feeling is that we probably don't need to monitor them for a month--I suspect we'd know within a few days whether removing the cache adversely affected things. If we don't know by then, the effect is probably negligible and indistinguishable from background radiation.
Comment 9•2 years ago
|
||
It's probably not relevant anymore, but I'd like to add another problem with per-instance in-memory caching: Cache invalidation becomes really hard, since the instance that handles an upload has to invalidate the caches on all instances, so we'd need some kind of broadcast mechanism. Removing the cache entirely sounds good to me.
| Assignee | ||
Comment 10•2 years ago
•
|
||
Thanks Willkg and Sven. I'll plan to monitor for a much shorter period of time, and that's a good point about cache invalidation becoming more difficult.
I've created the temporary "Bug 1759554: remove django-cache-memoize" Grafana dashboard to monitor the impact of this change with the above panels copied over (thanks Willkg for showing me an easy way to copy over the panels). I've put the dashboard in my personal bdanforth folder.
| Assignee | ||
Comment 11•2 years ago
|
||
| Assignee | ||
Comment 12•2 years ago
|
||
| Assignee | ||
Comment 13•2 years ago
|
||
This is in stage, but willkg suggested we run the systemtests in stage before deploying this to production. Those tests first need to be updated, so I'm picking up bug 1866796 before I test this patch in prod.
| Reporter | ||
Comment 14•2 years ago
•
|
||
This was deployed just now in bug #1880345 in v2024.02.14.
I checked the Tecken App Metrics dashboard and it looks ok. The download timings have gone up, but that's expected:
I also checked the Socorro processor timing metrics. They don't seem noticeably different, so I think we're good.
| Assignee | ||
Comment 15•2 years ago
|
||
Thanks Willkg for the deploy and for checking the dashboards.
Prior to the deploy, I updated the system tests in bug 1866796 (PR in review) and ran them against all envs (local, stage, and prod) with these changes, and they all passed.
I've also reviewed the panels, and they look okay to me as well (interestingly, the bug-specific dashboard I created is missing two panels and a few sections I had created...). I see no or minor increases in latency as expected and within the expected range (~25ms). Marking this ticket as resolved.
Description
•