Closed Bug 1426225 Opened 6 years ago Closed 6 years ago

[ops infra socorro] no space on device for processor nodes

Categories

(Socorro :: Infra, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Unassigned)

References

Details

Attachments

(1 file)

We're seeing these on the processor nodes:

IOError: [Errno 28] No space left on device: '...'

Two things:

1. looks like the processors are running out of space pretty fast after launch; are they cleaning up after themselves? is there a second process that should be running and isn't?

2. these errors are not getting to Sentry, so we would have no idea it was happening except by accident

This bug covers figuring this out and spinning off other bugs, fixing things, or whatever.
We need to fix this before -stage-new is viable, so I'm making this a P1.
Priority: -- → P1
Miles noted the following (roughly paraphrased):

> There's 22G in /data. 931 files--100 of them are *.dump files. /data/symbols/cache is 9.6G, /data/symbols/tmp is 12G.

Looks like -stage has 8g root node, but has an 148gb volume attached.

Symbols files get cleaned up by the companion process, but .dump files get cleaned up by the processor after processing the crash.


We're going to tackle this on a few fronts:

1. We should be capturing crash fetching errors and sending them to Sentry. I'm going to look into fixing that.

2. We probably need more space for symbols caching. Miles is working on that.

3. 100 dump files seems fishy. Maybe the processor isn't cleaning up .dump files in some circumstances? Will will look into that.
On my end, I've added an 150Gb EBS volume to the processor nodes mounted at /data (which the processor saves to). This mirrors how the -stage processor is configured.
Commits pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/8ff73900b78b70df41743ea5aaac3184761272f0
bug 1426225 - capture fetch raw crash errors

The part of the processor that fetches raw crash metadata and dumps and
then stores bits on disk could fail and when it did, it'd log something, but
unless someone was watching the logs at the time, no one would ever know.

This fixes that in the same way we capture other processing errors--it
sends any errors to sentry assuming sentry is configured.

https://github.com/mozilla-services/socorro/commit/8b009eb5ad07f3cd1956b3d858a3388a708b5814
Merge pull request #4267 from willkg/1426225-no-space

bug 1426225 - capture fetch raw crash errors
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: