Closed Bug 1578504 Opened 6 years ago Closed 6 years ago

processor memory usage high after upgrading to sentry-sdk to 0.10.2

Categories

(Socorro :: Processor, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: jwhitlock)

References

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1571432 +++

In bug #1571432, memory was being used up in the processor nodes over time until they failed. The webapp nodes were fine. The crontabber process doesn't run long enough to know whether it had the same problem or not.

When we downgraded sentry-sdk library from 0.10.2 back to 0.9.5, the problem went away.

I spent a bunch of time trying to reproduce the issue with the assumption that the new tracing code had something that was growing unbounded--maybe tracing log entries or something like that. I was never able to reproduce the issue outside of running in production.

This issue covers figuring that out and/or upgrading sentry-sdk to a version that doesn't have problems.

I should have written this up a month ago--that's my bad. Here's everything I noted:

I looked into this a bit between August 5th and August 7th. I wrote a test script to see if I could isolate the scenario that caused the issue, but was never able to reproduce it.

I diffed the code between 0.9.5 and 0.10.2 and the most significant change was the addition of the tracing code.

I thought maybe it was related to how we use sentry in the webapp vs. the processor. In the webapp, it's a wsgi middleware. In the processor, we call .capture_error() in specific places.

Maybe the hub that we create accretes some data over time that never gets freed up or truncated? Maybe breadcrumbs or tracing entries? Maybe something threading-related? John wondered if it was the new subprocess stuff in the default integration.

We are able to see the problem manifest in -stage, but it takes like 6 hours to be sure one way or the other. I was never able to reproduce it locally.

Making this a P3. If someone has time/inclination, then it'd be a good thing to figure out. Otherwise, it can wait.

Priority: -- → P3

I recall there were few sentry events during the high-memory period, so I doubt the issue is in our exception collection code. However, Sentry now collects breadcrumbs (and some tracing stuff with 0.10?) during execution, just in case there is a log or crash, so there is a way to leak or accumulate memory even without raising exceptions.

My subprocess guess was due to commits around https://github.com/getsentry/sentry-python/pull/449, "Fix a bunch of bugs in subprocess integration", which landed after 0.10.2 (in 0.11.0). We use subprocess to call the tools to symbolicate crash dumps. It is possible that the new logging code was busted around subprocess, and this change fixed it. Or, we send and receive a lot of data to a subprocess, and now there is tracing code that is holding on to it. I think this is mostly wishful thinking on my part that this subprocess fix may have fixed our issue as well.

I might try the tracemalloc trick from https://github.com/getsentry/sentry-python/issues/430, and see if I can reproduce the memory increase from 0.9.5 to 0.10.2, and if so, if it remains in 0.11.0. If not, we may just need to run in prod and be ready to revert.

We can see the problem in stage by watching a graph of memory usage of the processor nodes over time--we don't need to test this in prod.

I ran the processor app with tracemalloc and various versions of sentry-sdk. tracemalloc keeps track of allocations, but not deallocations, so it can't be used to detect a leak. I processed 20 crashes, but used a different 20 each time, and closed the processor app once I started seeing QueuingThread - there is nothing to do. Sleeping for 7 seconds. I didn't monitor closely while running. All of this is to say these numbers aren't good science, but they show memory behaviour changes in 0.10.2, and that 0.11.2 seems to be closer to 0.9.5.

Here's the top 10 allocations for sentry-sdk 0.9.5:

Code Size Count Average
<frozen importlib._bootstrap_external>:487 8920 KiB 88547 103 B
<frozen importlib._bootstrap>:219 1710 KiB 18544 94 B
/usr/local/lib/python3.6/linecache.py:137 512 KiB 5033 104 B
/usr/local/lib/python3.6/json/decoder.py:355 496 KiB 7303 70 B
/usr/local/lib/python3.6/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 408 KiB 2566 163 B
/usr/local/lib/python3.6/abc.py:133 403 KiB 1952 211 B
/app/socorro/lib/util.py:18 334 KiB 1852 185 B
/app/socorro/external/boto/crashstorage.py:274 208 KiB 1 208 KiB
/usr/local/lib/python3.6/_weakrefset.py:37 193 KiB 1639 121 B
/usr/local/lib/python3.6/collections/__init__.py:429 168 KiB 2232 77 B

With sentry-sdk 0.10.2, the orderedset from configman rises up to a top allocator:

Code Size Count Average
<frozen importlib._bootstrap_external>:487 7643 KiB 88573 88 B
/usr/local/lib/python3.6/site-packages/configman/orderedset.py:44 3978 KiB 93056 44 B
<frozen importlib._bootstrap>:219 1704 KiB 18476 94 B
/usr/local/lib/python3.6/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 1688 KiB 2566 674 B
/usr/local/lib/python3.6/json/decoder.py:355 868 KiB 12729 70 B
/usr/local/lib/python3.6/site-packages/configman/orderedset.py:29 709 KiB 13012 56 B
/usr/local/lib/python3.6/site-packages/boto/s3/key.py:1799 640 KiB 2 320 KiB
/app/socorro/lib/util.py:18 621 KiB 3488 182 B
/usr/local/lib/python3.6/linecache.py:137 512 KiB 5033 104 B
/app/socorro/external/boto/crashstorage.py:274 415 KiB 2 207 KiB

With sentry-sdk 0.11.2, configman's orderedset is no longer in the top 10, and it looks to me more like sentry 0.9.5:

Code Size Count Average
<frozen importlib._bootstrap_external>:487 7643 KiB 88575 88 B
<frozen importlib._bootstrap>:219 1709 KiB 18519 94 B
/usr/local/lib/python3.6/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 1688 KiB 2566 674 B
/usr/local/lib/python3.6/json/decoder.py:355 764 KiB 10413 75 B
/usr/local/lib/python3.6/linecache.py:137 512 KiB 5033 104 B
/app/socorro/lib/util.py:18 504 KiB 2891 179 B
/usr/local/lib/python3.6/abc.py:133 404 KiB 1951 212 B
/usr/local/lib/python3.6/_weakrefset.py:37 195 KiB 1650 121 B
/usr/local/lib/python3.6/collections/__init__.py:429 172 KiB 2290 77 B
/usr/local/lib/python3.6/_weakrefset.py:38 160 KiB 1677 98 B

Again, this isn't proof, but I have some confidence that 0.11.2 could fix the memory issue we saw in 0.10.2, and we should try in stage. And maybe get rid of configman.

Assignee: nobody → jwhitlock
Status: NEW → ASSIGNED

This has been deployed to production, and memory use appears normal.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: