processor memory usage high after upgrading to sentry-sdk to 0.10.2
Categories
(Socorro :: Processor, defect, P3)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: jwhitlock)
References
Details
Attachments
(1 file)
|
81 bytes,
text/plain
|
Details |
+++ This bug was initially created as a clone of Bug #1571432 +++
In bug #1571432, memory was being used up in the processor nodes over time until they failed. The webapp nodes were fine. The crontabber process doesn't run long enough to know whether it had the same problem or not.
When we downgraded sentry-sdk library from 0.10.2 back to 0.9.5, the problem went away.
I spent a bunch of time trying to reproduce the issue with the assumption that the new tracing code had something that was growing unbounded--maybe tracing log entries or something like that. I was never able to reproduce the issue outside of running in production.
This issue covers figuring that out and/or upgrading sentry-sdk to a version that doesn't have problems.
| Reporter | ||
Comment 1•6 years ago
|
||
I should have written this up a month ago--that's my bad. Here's everything I noted:
I looked into this a bit between August 5th and August 7th. I wrote a test script to see if I could isolate the scenario that caused the issue, but was never able to reproduce it.
I diffed the code between 0.9.5 and 0.10.2 and the most significant change was the addition of the tracing code.
I thought maybe it was related to how we use sentry in the webapp vs. the processor. In the webapp, it's a wsgi middleware. In the processor, we call .capture_error() in specific places.
Maybe the hub that we create accretes some data over time that never gets freed up or truncated? Maybe breadcrumbs or tracing entries? Maybe something threading-related? John wondered if it was the new subprocess stuff in the default integration.
We are able to see the problem manifest in -stage, but it takes like 6 hours to be sure one way or the other. I was never able to reproduce it locally.
| Reporter | ||
Comment 2•6 years ago
|
||
Making this a P3. If someone has time/inclination, then it'd be a good thing to figure out. Otherwise, it can wait.
| Assignee | ||
Comment 3•6 years ago
|
||
I recall there were few sentry events during the high-memory period, so I doubt the issue is in our exception collection code. However, Sentry now collects breadcrumbs (and some tracing stuff with 0.10?) during execution, just in case there is a log or crash, so there is a way to leak or accumulate memory even without raising exceptions.
My subprocess guess was due to commits around https://github.com/getsentry/sentry-python/pull/449, "Fix a bunch of bugs in subprocess integration", which landed after 0.10.2 (in 0.11.0). We use subprocess to call the tools to symbolicate crash dumps. It is possible that the new logging code was busted around subprocess, and this change fixed it. Or, we send and receive a lot of data to a subprocess, and now there is tracing code that is holding on to it. I think this is mostly wishful thinking on my part that this subprocess fix may have fixed our issue as well.
I might try the tracemalloc trick from https://github.com/getsentry/sentry-python/issues/430, and see if I can reproduce the memory increase from 0.9.5 to 0.10.2, and if so, if it remains in 0.11.0. If not, we may just need to run in prod and be ready to revert.
| Reporter | ||
Comment 4•6 years ago
|
||
We can see the problem in stage by watching a graph of memory usage of the processor nodes over time--we don't need to test this in prod.
| Assignee | ||
Comment 5•6 years ago
|
||
I ran the processor app with tracemalloc and various versions of sentry-sdk. tracemalloc keeps track of allocations, but not deallocations, so it can't be used to detect a leak. I processed 20 crashes, but used a different 20 each time, and closed the processor app once I started seeing QueuingThread - there is nothing to do. Sleeping for 7 seconds. I didn't monitor closely while running. All of this is to say these numbers aren't good science, but they show memory behaviour changes in 0.10.2, and that 0.11.2 seems to be closer to 0.9.5.
Here's the top 10 allocations for sentry-sdk 0.9.5:
| Code | Size | Count | Average |
|---|---|---|---|
<frozen importlib._bootstrap_external>:487 |
8920 KiB | 88547 | 103 B |
<frozen importlib._bootstrap>:219 |
1710 KiB | 18544 | 94 B |
/usr/local/lib/python3.6/linecache.py:137 |
512 KiB | 5033 | 104 B |
/usr/local/lib/python3.6/json/decoder.py:355 |
496 KiB | 7303 | 70 B |
/usr/local/lib/python3.6/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 |
408 KiB | 2566 | 163 B |
/usr/local/lib/python3.6/abc.py:133 |
403 KiB | 1952 | 211 B |
/app/socorro/lib/util.py:18 |
334 KiB | 1852 | 185 B |
/app/socorro/external/boto/crashstorage.py:274 |
208 KiB | 1 | 208 KiB |
/usr/local/lib/python3.6/_weakrefset.py:37 |
193 KiB | 1639 | 121 B |
/usr/local/lib/python3.6/collections/__init__.py:429 |
168 KiB | 2232 | 77 B |
With sentry-sdk 0.10.2, the orderedset from configman rises up to a top allocator:
| Code | Size | Count | Average |
|---|---|---|---|
<frozen importlib._bootstrap_external>:487 |
7643 KiB | 88573 | 88 B |
/usr/local/lib/python3.6/site-packages/configman/orderedset.py:44 |
3978 KiB | 93056 | 44 B |
<frozen importlib._bootstrap>:219 |
1704 KiB | 18476 | 94 B |
/usr/local/lib/python3.6/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 |
1688 KiB | 2566 | 674 B |
/usr/local/lib/python3.6/json/decoder.py:355 |
868 KiB | 12729 | 70 B |
/usr/local/lib/python3.6/site-packages/configman/orderedset.py:29 |
709 KiB | 13012 | 56 B |
/usr/local/lib/python3.6/site-packages/boto/s3/key.py:1799 |
640 KiB | 2 | 320 KiB |
/app/socorro/lib/util.py:18 |
621 KiB | 3488 | 182 B |
/usr/local/lib/python3.6/linecache.py:137 |
512 KiB | 5033 | 104 B |
/app/socorro/external/boto/crashstorage.py:274 |
415 KiB | 2 | 207 KiB |
With sentry-sdk 0.11.2, configman's orderedset is no longer in the top 10, and it looks to me more like sentry 0.9.5:
| Code | Size | Count | Average |
|---|---|---|---|
<frozen importlib._bootstrap_external>:487 |
7643 KiB | 88575 | 88 B |
<frozen importlib._bootstrap>:219 |
1709 KiB | 18519 | 94 B |
/usr/local/lib/python3.6/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 |
1688 KiB | 2566 | 674 B |
/usr/local/lib/python3.6/json/decoder.py:355 |
764 KiB | 10413 | 75 B |
/usr/local/lib/python3.6/linecache.py:137 |
512 KiB | 5033 | 104 B |
/app/socorro/lib/util.py:18 |
504 KiB | 2891 | 179 B |
/usr/local/lib/python3.6/abc.py:133 |
404 KiB | 1951 | 212 B |
/usr/local/lib/python3.6/_weakrefset.py:37 |
195 KiB | 1650 | 121 B |
/usr/local/lib/python3.6/collections/__init__.py:429 |
172 KiB | 2290 | 77 B |
/usr/local/lib/python3.6/_weakrefset.py:38 |
160 KiB | 1677 | 98 B |
Again, this isn't proof, but I have some confidence that 0.11.2 could fix the memory issue we saw in 0.10.2, and we should try in stage. And maybe get rid of configman.
| Assignee | ||
Comment 6•6 years ago
|
||
| Assignee | ||
Comment 7•6 years ago
|
||
| Assignee | ||
Comment 8•6 years ago
|
||
This has been deployed to production, and memory use appears normal.
Description
•