Closed
Bug 1426225
Opened 6 years ago
Closed 6 years ago
[ops infra socorro] no space on device for processor nodes
Categories
(Socorro :: Infra, task, P1)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Unassigned)
References
Details
Attachments
(1 file)
We're seeing these on the processor nodes: IOError: [Errno 28] No space left on device: '...' Two things: 1. looks like the processors are running out of space pretty fast after launch; are they cleaning up after themselves? is there a second process that should be running and isn't? 2. these errors are not getting to Sentry, so we would have no idea it was happening except by accident This bug covers figuring this out and spinning off other bugs, fixing things, or whatever.
Reporter | ||
Comment 1•6 years ago
|
||
We need to fix this before -stage-new is viable, so I'm making this a P1.
Priority: -- → P1
Reporter | ||
Comment 2•6 years ago
|
||
Miles noted the following (roughly paraphrased):
> There's 22G in /data. 931 files--100 of them are *.dump files. /data/symbols/cache is 9.6G, /data/symbols/tmp is 12G.
Looks like -stage has 8g root node, but has an 148gb volume attached.
Symbols files get cleaned up by the companion process, but .dump files get cleaned up by the processor after processing the crash.
We're going to tackle this on a few fronts:
1. We should be capturing crash fetching errors and sending them to Sentry. I'm going to look into fixing that.
2. We probably need more space for symbols caching. Miles is working on that.
3. 100 dump files seems fishy. Maybe the processor isn't cleaning up .dump files in some circumstances? Will will look into that.
Reporter | ||
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
On my end, I've added an 150Gb EBS volume to the processor nodes mounted at /data (which the processor saves to). This mirrors how the -stage processor is configured.
Comment 5•6 years ago
|
||
Commits pushed to master at https://github.com/mozilla-services/socorro https://github.com/mozilla-services/socorro/commit/8ff73900b78b70df41743ea5aaac3184761272f0 bug 1426225 - capture fetch raw crash errors The part of the processor that fetches raw crash metadata and dumps and then stores bits on disk could fail and when it did, it'd log something, but unless someone was watching the logs at the time, no one would ever know. This fixes that in the same way we capture other processing errors--it sends any errors to sentry assuming sentry is configured. https://github.com/mozilla-services/socorro/commit/8b009eb5ad07f3cd1956b3d858a3388a708b5814 Merge pull request #4267 from willkg/1426225-no-space bug 1426225 - capture fetch raw crash errors
Reporter | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•