Closed Bug 1229020 Opened 9 years ago Closed 9 years ago

ETL nodes on stage running low on disk space

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: fubar)

References

Details

"time to full 8 days"

https://rpm.newrelic.com/accounts/677903/servers/5753307/disks?tw%5Bend%5D=1448885356&tw%5Bstart%5D=1448280556#id=5b2253797374656d2f46696c6573797374656d2f5e2f557365642f6279746573222c2253797374656d2f4469736b2f5e6465765e73646133225d

Of the ~20GB, 17GB is in the /var/log/celery directory:

[emorley@treeherder-etl2.stage.private.scl3 celery]$ ll -h
total 17G
-rw-r--r-- 1 treeherder treeherder 1.1M Nov 30 15:01 celery_worker_buildapi.log
-rw-r--r-- 1 treeherder treeherder 150M Nov  8 03:16 celery_worker_buildapi.log-20151108
-rw-r--r-- 1 treeherder treeherder 6.8M Nov 15 03:43 celery_worker_buildapi.log-20151115
-rw-r--r-- 1 treeherder treeherder 6.3M Nov 22 03:26 celery_worker_buildapi.log-20151122
-rw-r--r-- 1 treeherder treeherder 6.6M Nov 29 03:15 celery_worker_buildapi.log-20151129
-rw-r--r-- 1 treeherder treeherder 5.7G Nov 30 15:01 celery_worker_pushlog.log
-rw-r--r-- 1 treeherder treeherder  73M Nov  8 03:16 celery_worker_pushlog.log-20151108
-rw-r--r-- 1 treeherder treeherder  67M Nov 15 03:43 celery_worker_pushlog.log-20151115
-rw-r--r-- 1 treeherder treeherder  57M Nov 22 03:26 celery_worker_pushlog.log-20151122
-rw-r--r-- 1 treeherder treeherder  11G Nov 29 03:15 celery_worker_pushlog.log-20151129

[emorley@treeherder-etl2.stage.private.scl3 celery]$ head -n 10 celery_worker_pushlog.log
[2015-11-28 19:15:03,226: INFO/Worker-129] Starting new HTTPS connection (1): treeherder.allizom.org
[2015-11-28 19:15:03,283: ERROR/Worker-129] Error submitting data to https://treeherder.allizom.org/api/project/try/resultset/
[2015-11-28 19:15:03,285: ERROR/Worker-129] Request headers: {'Content-Length': '2341', 'Accept-Encoding': 'gzip, deflate', 'Accept': u'application/json; version=1.0', 'User-Agent': u'treeherder-pyclient/1.8.0', 'Connection': 'keep-alive', 'Content-Type': 'application/json', 'Authorization': u'Hawk mac="REDACTED", hash="REDACTED", id="treeherder-etl", ts="1448766903", nonce="REDACTED"'}
[2015-11-28 19:15:03,285: ERROR/Worker-129] Request body: [{"revision_hash": "ec3d327291daa266aecb9de8d56993704a3e9261", "push_timestamp": 1448612966, "author": "cpeterson@mozilla.com", "active_status": "active", "revisions": [{"comment": "Bug NUMBER - Fix MSVC C4273 \"inconsistent dll linkage\" warnings in js. r?\n\njs/src/jsmath.cpp(1121) : warning C4273: 'log2' : inconsistent dll linkage\njs/src/jsmath.cpp(1146) : warning C4273: 'log1p' : inconsistent dll linkage\njs/src/jsmath.cpp(1192) : warning C4273: 'expm1' : inconsistent dll linkage\njs/src/jsmath.cpp(1296) : warning C4273: 'acosh' : inconsistent dll linkage\njs/src/jsmath.cpp(1406) : warning C4273: 'atanh' : inconsistent dll linkage\njs/src/jsmath.cpp(1606) : warning C4273: 'cbrt' : inconsistent dll linkage", "author": "Chris Peterson <cpeterson@mozilla.com>", 
...
[2015-11-28 19:15:03,286: ERROR/Worker-129] Response headers: {'content-encoding': 'gzip', 'transfer-encoding': 'chunked', 'x-backend-server': 'treeherder1.stage.webapp.scl3.mozilla.com', 'vary': 'Accept,Cookie, Accept-Encoding', 'server-authorization': 'Hawk mac="REDACTED", hash="REDACTED"', 'keep-alive': 'timeout=5, max=1000', 'retry-after': '1', 'connection': 'Keep-Alive', 'allow': 'GET, POST, HEAD, OPTIONS', 'x-newrelic-app-data': 'REDACTED', 'date': 'Sun, 29 Nov 2015 03:15:03 GMT', 'server': 'gunicorn/19.3.0', 'content-type': 'application/json; version=1.0'}
[2015-11-28 19:15:03,286: ERROR/Worker-129] Response body: {"detail":"Request was throttled. Expected available in 1.0 second."}
[2015-11-28 19:15:03,290: INFO/Worker-129] Starting new HTTPS connection (1): treeherder.allizom.org
[2015-11-28 19:15:03,337: ERROR/Worker-129] Error submitting data to https://treeherder.allizom.org/api/project/try/resultset/
[2015-11-28 19:15:03,339: ERROR/Worker-129] Request headers: {'Content-Length': '658', 'Accept-Encoding': 'gzip, deflate', 'Accept': u'application/json; version=1.0', 'User-Agent': u'treeherder-pyclient/1.8.0', 'Connection': 'keep-alive', 'Content-Type': 'application/json', 'Authorization': u'Hawk mac="REDACTED", hash="REDACTED", id="treeherder-etl", ts="1448766903", nonce="REDACTED"'}
[2015-11-28 19:15:03,339: ERROR/Worker-129] Request body: [{"revision_hash": "735dbdbded5dc0987d26626dd6687a511ca91609", "push_timestamp": 1448495934, "author": "eisaacson@mozilla.com", "active_status": "active", "revisions": [{"comment": "Bug 1226015 - Have child send __delete__ in speech synth request protocol, fixes race. r=smaug", "author": "Eitan Isaacson <eitan@monotonous.org>", "repository": "try", "branch": "default", "revision": "890ea0115643d338f4a251604457a4b4711f24a0"}, {"comment": "try: -b do -p all -u all -t all --post-to-bugzilla Bug 1226015", "author": "Eitan Isaacson <eitan@monotonous.org>", "repository": "try", "branch": "default", "revision": "0659b8e884485f6804bf3733f258e2f3354bfb79"}]}]


Few issues:
1) We dump body+headers for both the request and response - I think we should only log response body (and save the headers and other stuff for log.debug())
2) We hit our own rate limit for resultset submission (likely due to fetch-missing-pushlog spam)
Depends on: 1229022
We should just get rid of fetch-missing-pushlogs once and for all; it was really the cause of this. (I imagine once we hit the rate limit, the normal ingestion was not able to submit resultsets, causing more "missing pushes" and so even more requests hammering the API.)
Depends on: 1191934
why do we need celery logs for more than a couple of days?
I think it's useful to keep them for up to 7 days (we don't always have time to look into an issue straight away) - and capping at 7 days still wouldn't help us in this case, since 16.5GB of the 17GB was in the last 1.5 days alone.
the logrotate jobs are, for some unfathomable reason, NOT compressing. I think there's an issue in that logrotate doesn't have a great way to tell the celery jobs that their log has rolled, but it seems likely that we'd get a push within a week so using delaycompress is sufficient.
Changed logrotate config to use delaycompress and copytruncate. Are the celery jobs using WatchedFileHandler ? That would help, and we could avoid copytruncate and possibly losing a tiny bit of logging.
This is becoming a problem on prod too, a new relic alert was just opened:

https://rpm.newrelic.com/accounts/677903/incidents/18721745

Do we need to do something similar there? It looks like we're waiting a week to compress on prod:

[wlachance@... celery]$ ls -alh
total 17G
drwxr-xr-x   2 treeherder treeherder 4.0K Nov 30 20:27 .
drwxr-xr-x. 15 root       root       4.0K Dec  2 22:52 ..
-rw-r--r--   1 treeherder treeherder 5.2M Dec  2 22:52 celery_worker_buildapi.log
-rw-r--r--   1 treeherder treeherder 1.1M Nov  8 03:31 celery_worker_buildapi.log-20151108.gz
-rw-r--r--   1 treeherder treeherder  24M Nov 15 03:41 celery_worker_buildapi.log-20151115.gz
-rw-r--r--   1 treeherder treeherder 1.1M Nov 22 03:41 celery_worker_buildapi.log-20151122.gz
-rw-r--r--   1 treeherder treeherder  13M Nov 29 03:17 celery_worker_buildapi.log-20151129
-rw-r--r--   1 treeherder treeherder  11G Dec  2 22:53 celery_worker_pushlog.log
-rw-r--r--   1 treeherder treeherder  11M Nov  8 03:31 celery_worker_pushlog.log-20151108.gz
-rw-r--r--   1 treeherder treeherder  12M Nov 15 03:41 celery_worker_pushlog.log-20151115.gz
-rw-r--r--   1 treeherder treeherder 9.1M Nov 22 03:42 celery_worker_pushlog.log-20151122.gz
-rw-r--r--   1 treeherder treeherder 5.8G Nov 29 03:17 celery_worker_pushlog.log-20151129
Flags: needinfo?(klibby)
yes, delaycompress and weekly rotation means last week's logs will not be gzipped. I'll zip this one, but you're still logging over 3GB *daily* to celery_worker_pushlog.log.
Flags: needinfo?(klibby)
treeherder-etl1.private ran out of disk space this morning, as logrotated didn't have enough disk space to rotate/compress the logs yesterday.  I wiped out the 28gb log file and log rotation will now happen daily, keeping 14 days worth of logs.
Many thanks! :-)
Depends on: 1232776
Assignee: emorley → klibby
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.