stage submitter ran out of disk and is sadmad



a year ago
a year ago


(Reporter: willkg, Assigned: willkg)


Firefox Tracking Flags

(Not tracked)


About 20 minutes ago, the stage-submitter stopped submitting crashes to the -stage collector. I logged into the node and looked around and discovered it had run out of disk.

I looked for "big files" and "things that aren't right" and didn't see anything wild. I suspect that the node just isn't sized right disk-wise. Looks like it's got 8gb of disk.

This bug covers figuring out what to do and doing it.
Some observations:

1. socorro-submitter.log files are about 150mb -> 180mb or so.
2. Log rotation is set up to keep 14 days of them. 14 days * 180mb = 2.5gb or so.
3. The submitter downloads dump files from s3 and puts them in /tmp, so it's using disk for temporary storage of the things it's submitting.

Some thoughts:

1. We could change log rotation to only keep 1 week of socorro-submitter.log files. We usually look into things quickly. We haven't had a situation where we cared about old log files.
2. We could rebuild this node with one with a bigger disk. Looks like it dates back to 2014.
3. We could change this from a long-running node to one that's deployed with everything else. When we switch to Antenna, we're going to have to rethink stage submission, so maybe this isn't worth the work?

It's mfbt on a friday, so I'm not doing anything big now. I deleted some old unused log files and that cleared up some space.
Also, I noticed a bunch of submitter processes still running and causing a mess of things. :(
I killed a bunch of orphaned processes and suddenly we're back down to 71% disk.

I add to my thoughts:

4. We could fix the cron script to kill off any submitters that are running before running a new one.
Moving this to infra since this is an infra issue.
Component: General → Infra
The stage submitter pinged Miles and said, "hey--i'm feeling sadmad again. want to go out for a drink?" Miles mentioned it casually in passing. So I looked at the node.

It's using 77% disk. Last week it was at 70%.

First, we've been having higher crash volume on Socorro -prod. Higher crash volume -> more crashes siphoned to -stage -> more stage submitter work -> larger stage submitter logs for those days -> more disk usage. I think that's part of the issue.

Second, I saw a bunch of orphaned processes again. I killed them all off and pondered the ethics of adding a "kill all orphaned processes ever" cron job to run at the beginning of the day. I looked at the logs for the days we had orphaned processes and noticed lots of lines like these:

2017/01/27 03:36:53 [ERR] (view) "storeKeyPrefix(socorro/common)" Unexpected response code: 500
2017/01/27 03:36:53 [ERR] (runner) watcher reported error: Unexpected response code: 500
2017/01/27 03:36:54 [ERR] (view) "storeKeyPrefix(socorro/submitter)" Unexpected response code: 500
2017/01/27 03:36:54 [ERR] (runner) watcher reported error: Unexpected response code: 500
2017/01/27 03:36:54 [ERR] (view) "storeKeyPrefix(socorro/common)" Unexpected response code: 500
2017/01/27 03:36:54 [ERR] (runner) watcher reported error: Unexpected response code: 500
2017/01/27 03:43:56 [ERR] (view) "storeKeyPrefix(socorro/submitter)" Unexpected response code: 500
2017/01/27 03:43:56 [ERR] (runner) watcher reported error: Unexpected response code: 500

Doing a search on the Internet suggested this:

That got me suspecting envconsul and got me wondering whether we're using it right in the context of a script. Turns out it's got a -once arg that you can pass which tells it not to do demonic things.

I added that to the script. I watched a few cron runs and it seems fine. I'll let it run for a while and check the node on Friday and see if it has orphaned processes again.
I looked at the submitter just now and there are no orphaned processes. I think the -once change either fixed that issue or helped or whatever was causing the issue has gone away and -once didn't hurt. Anyhow, I think I'm going to conclude that aspect of this issue is fixed.

Regarding disk space on that node, it's using 78% of disk. The socorro submitter log files account for a large chunk of that (3.3gb out of 8gb). I thought we had logrotate set up to save 14 days, but it turns out it's 20. We don't need that much. I'm dropping that down to 7.

logrotate should clean up the logs next time it runs. I'll check the number of logs and the disk space tomorrow.
Given that this bug consists of a back and forth with myself, I'm going to assign it to me and continue the soliloquy.
Assignee: nobody → willkg
Logrotate nixed a bunch of logs and seems fine. Disk space usage is down to 55%. There are no orphaned processes.

I'm going to say this is FIXED now. All our theories have been addressed and the symptoms are gone.
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.