Closed Bug 1480559 Opened 7 years ago Closed 7 years ago

Symbolication frequently takes more than 30s

Categories

(Tecken :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mstange, Unassigned)

Details

Attachments

(4 files)

Attached file curl command
I am seeing a problem where symbolication takes a long time if I haven't captured a profile in a while. I am seeing the following effect: - I grab a profile and symbolication takes a long time, 30s+. - I grab another profile and symbolication is fast. - I don't grab a profile for 5-10min or more. - The next profile I grab has very slow symbolication again. I am attaching an example curl command that seems to reproduce this behavior.
The blue line is the Datadog avg of the timings. The purple line is a moving average of 5. The yellow line a moving average of 20. Y-axis is milliseconds.
Component: Symbols → Tecken Integration
The description smells like the whole system is struggling with overload. Our Redis Store cache is (most likely) storing things for a very long time so we should have very little slowdown just due to S3 downloads. We have roughly 10 symbolication requests per minute going on. The perf.html team estimates that they send roughly 1,500 requests per month which means that 99.9% of the symbolication traffic comes from something else. Most likely from Telemetry symbolication of crashes. There are many things we can do; measure, measure, measure. Perhaps we just need to beef up the infra or perhaps there are some bottlenecks in the Python/boto code that can be solved.
I ran Markus's curl command through a little python script 100 times with a sleep of 5 seconds in between (from my home broadband). This time it was steady https://gist.github.com/peterbe/e01ded90d183c955c81c3fdd99fad78c
> Our Redis Store cache is (most likely) storing things for a very long time so we should have very little slowdown just due to S3 downloads. Do we track the cache hit rate? I'd assume not hitting the cache would lead to the significant slower requests.
The dark-blue bars means we have to go out to S3 to get the symbol file (raw), parse it and stick it into the Redis Store. S3 can be slow sometimes but it's very very rare and doesn't explain these terrible numbers. Sure, the cache hits (getting stuff from ElastiCache) can be slow sometimes too but it's rare also.
There are many things we can do but first; measure, measure, measure. For example, I'd like to expand the timer on the Symbolication Django view to see which parts within are slow. The sum of those individual timers should roughly match the "Symbolication Django view timings" graph in Datadog (see first screenshot attachment). Another thing we could measure is the health of the EC2 nodes. And the ElastiCache clusters. We have two. I'm going to investigate how we're doing on memory usage there.
Ben, Can you share some rough stats about how much symbolication requests (and # jobs within each) you do?
Flags: needinfo?(bewu)
Miles, The Tecken Operations dashboard has a graph for ElastiCache evictions. https://app.datadoghq.com/dash/286319/tecken-operations?live=true&page=0&is_auto=false&from_ts=1532639565832&to_ts=1533244365832&tile_size=m&fullscreen=209697669 It seems we do a LOT of evictions. There's also this dashboard https://app.datadoghq.com/screen/190509/tecken-redis?page=0&is_auto=false&from_ts=1533241140000&to_ts=1533244740000&live=true but it's hard to decipher. Can you take a look at how we're doing with memory on the Redis Store ElastiCache cluster? All that red makes me wonder if we're filling up the cluster too much and thus our LRU eviction policy can't keep enough data. If this is the case we might need to increase the size of that cluster to one with a lot more memory.
Flags: needinfo?(miles)
(In reply to Peter Bengtsson [:peterbe] from comment #7) > Ben, > Can you share some rough stats about how much symbolication requests (and # > jobs within each) you do? Right now I'm making about 70 000 requests per day. At peak hours it gets to around 4700 per hour. Every request is one job because since the crashes are being streamed it's much easier to do one at a time.
Flags: needinfo?(bewu)
It looks like we are indeed filling up the store cache. I think the appropriate action to take is to increase the size of the store cache node, increasing the memory available to the cache. The rolling spikes of evictions also correspond to elevated CPU usage on the cache as well, another indication that we're underprovisioned. It looks like it's possible to perform an instance upgrade without downtime [0]! So that's exciting. We're currently on one of the smaller node types available, a cache.m3.medium, which has 3.75GiB of RAM. I think making the jump to a memory optimized cache.r4.large with 15.25GiB of RAM would help alleviate our issues. Here's some more info about available cache node types [1] and [2]. Brian: Does scaling the store node from cache.m3.medium => cache.r4.large sound ok to you? [0]: https://aws.amazon.com/blogs/aws/elasticache-for-redis-update-upgrade-engines-and-scale-up/ [1]: https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/CacheNodes.SupportedTypes.html [2]: https://ec2instances.info/?filter=r4.large
Flags: needinfo?(miles) → needinfo?(bpitts)
I opened an issue for making redis better a while back at bug 1474702. Based on my analysis at the time, I wasn't sure that evictions leading to cache misses were the issue. Looking at it again, I'm now more inclined to agree that Redis needs more memory. Miles I think we do need downtime, but we won't lose data because of the SAVE mentioned in your aws blog post. If we're doing this we should look into converting to multi-az, so maintenance in the future will require less downtime.
Flags: needinfo?(bpitts)
Oh, I should have explicitly mentioned at cache.m3.medium => cache.r4.large is fine with me. And the dashboard I have for redis is https://app.datadoghq.com/dash/857077?live=true&page=0&is_auto=false&from_ts=1533162434690&to_ts=1533248834690&tile_size=m
Ok. Let's redirect our attention to https://bugzilla.mozilla.org/show_bug.cgi?id=1474702 and upgrade the Redis Store. That is a strong candidate for doing a pretty good job solving a large chunk of this. Once that's done, we'll keep an eye on the symbolication using this bug. And I might add some new Datadog timers to get better, future, insight into what in the whole Symbolication Django view request is taking up the time (prediction: empty cache on symbols that request S3 GETs).
The upgrade in bug 1474702 happened, and symbolication seems to be consistently fast now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED

Moving from Socorro product to Tecken product.

Component: Tecken → General
Product: Socorro → Tecken
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: