Closed
Bug 1483289
Opened 7 years ago
Closed 6 years ago
fetch_bugs tasks sometimes ignore the soft time limit and hit TimeLimitExceeded
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
We have soft and hard celery task time limits set up, such that a task is first given a SoftTimeLimitExceeded exception nudge, and then later the whole worker process killed and a TimeLimitExceeded generated by the master process.
Ideally we should never reach TimeLimitExceeded, since when that happens:
(a) the New Relic agent of the child process wouldn't have had a chance to report the exception
(b) the whole worker process has to restart, wasting time
However looking at the Papertrail logs, I can see it happening for some fetch_bugs tasks:
https://papertrailapp.com/systems/treeherder-prod/events?q="TimeLimitExceeded%3A TimeLimitExceeded"
Variant 1:
"""
Aug 13 04:10:00 treeherder-prod app/worker_default.2: [2018-08-13 03:10:00,065: WARNING/MainProcess] Soft time limit (600s) exceeded for fetch-bugs[204346fd-2723-4749-b1c3-629b00060f4f]
Aug 13 04:10:00 treeherder-prod app/worker_default.2: [2018-08-13 03:10:00,067] ERROR [treeherder.etl.bugzilla:72] error inserting bug '{u'keywords': [u'crash', u'intermittent-failure'], u'status': u'RESOLVED', u'resolution': u'INCOMPLETE', u'whiteboard': u'', u'id': 1466351, u'summary': u'Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ libvDSP.dylib + 0x49010]', u'cf_crash_signature': u'[@ libvDSP.dylib + 0x49010]', u'last_change_time': u'2018-06-25T06:41:52Z', u'op_sys': u'Unspecified'}' into db: SoftTimeLimitExceeded()
"""
Variant 2:
"""
Aug 12 13:10:00 treeherder-prod app/worker_default.2: [2018-08-12 12:10:00,073: WARNING/MainProcess] Soft time limit (600s) exceeded for fetch-bugs[c835d956-d571-4d69-8167-7c2f23bb37b5]
Aug 12 13:15:30 treeherder-prod app/worker_default.2: [2018-08-12 12:15:30,078: ERROR/MainProcess] Task fetch-bugs[c835d956-d571-4d69-8167-7c2f23bb37b5] raised unexpected: TimeLimitExceeded(930,)
Aug 12 13:15:30 treeherder-prod app/worker_default.2: Traceback (most recent call last):
Aug 12 13:15:30 treeherder-prod app/worker_default.2: File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 645, in on_hard_timeout
Aug 12 13:15:30 treeherder-prod app/worker_default.2: raise TimeLimitExceeded(job._timeout)
Aug 12 13:15:30 treeherder-prod app/worker_default.2: TimeLimitExceeded: TimeLimitExceeded(930,)
Aug 12 13:15:30 treeherder-prod app/worker_default.2: [2018-08-12 12:15:30,079: ERROR/MainProcess] Hard time limit (930s) exceeded for fetch-bugs[c835d956-d571-4d69-8167-7c2f23bb37b5]
Aug 12 13:15:35 treeherder-prod app/worker_default.2: [2018-08-12 12:15:34,766: ERROR/MainProcess] Process 'Worker-3' pid:41 exited with 'signal 9 (SIGKILL)'
"""
For variant 1, we're catching the `SoftTimeLimitExceeded` due to a generic `except Exception` (added in bug 1289574), and so ignoring it:
https://github.com/mozilla/treeherder/blob/1fc7f31b37b56eaae92f9be5706ab0c85a609251/treeherder/etl/bugzilla.py#L55-L72
We should catch on a more specific exception type instead.
For variant 2, I'm not entirely sure as to the cause - but given that there often seems to be a lot of DB warnings spew from the fetch_bugs tasks - perhaps the New Relic agent is taking longer than 30 seconds (the gap between the soft and hard time limits) to finish processing the transaction?
Assignee | ||
Comment 1•7 years ago
|
||
Actually for variant 2 there was a five minute gap, so no idea what happened there.
Assignee | ||
Comment 2•6 years ago
|
||
There have been no recent occurrences in the logs.
This was fixed by a combination of bug 1176492 and bug 1483301.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•