Closed Bug 1483289 Opened 7 years ago Closed 6 years ago

fetch_bugs tasks sometimes ignore the soft time limit and hit TimeLimitExceeded

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

We have soft and hard celery task time limits set up, such that a task is first given a SoftTimeLimitExceeded exception nudge, and then later the whole worker process killed and a TimeLimitExceeded generated by the master process. Ideally we should never reach TimeLimitExceeded, since when that happens: (a) the New Relic agent of the child process wouldn't have had a chance to report the exception (b) the whole worker process has to restart, wasting time However looking at the Papertrail logs, I can see it happening for some fetch_bugs tasks: https://papertrailapp.com/systems/treeherder-prod/events?q="TimeLimitExceeded%3A TimeLimitExceeded" Variant 1: """ Aug 13 04:10:00 treeherder-prod app/worker_default.2: [2018-08-13 03:10:00,065: WARNING/MainProcess] Soft time limit (600s) exceeded for fetch-bugs[204346fd-2723-4749-b1c3-629b00060f4f] Aug 13 04:10:00 treeherder-prod app/worker_default.2: [2018-08-13 03:10:00,067] ERROR [treeherder.etl.bugzilla:72] error inserting bug '{u'keywords': [u'crash', u'intermittent-failure'], u'status': u'RESOLVED', u'resolution': u'INCOMPLETE', u'whiteboard': u'', u'id': 1466351, u'summary': u'Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ libvDSP.dylib + 0x49010]', u'cf_crash_signature': u'[@ libvDSP.dylib + 0x49010]', u'last_change_time': u'2018-06-25T06:41:52Z', u'op_sys': u'Unspecified'}' into db: SoftTimeLimitExceeded() """ Variant 2: """ Aug 12 13:10:00 treeherder-prod app/worker_default.2: [2018-08-12 12:10:00,073: WARNING/MainProcess] Soft time limit (600s) exceeded for fetch-bugs[c835d956-d571-4d69-8167-7c2f23bb37b5] Aug 12 13:15:30 treeherder-prod app/worker_default.2: [2018-08-12 12:15:30,078: ERROR/MainProcess] Task fetch-bugs[c835d956-d571-4d69-8167-7c2f23bb37b5] raised unexpected: TimeLimitExceeded(930,) Aug 12 13:15:30 treeherder-prod app/worker_default.2: Traceback (most recent call last): Aug 12 13:15:30 treeherder-prod app/worker_default.2: File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 645, in on_hard_timeout Aug 12 13:15:30 treeherder-prod app/worker_default.2: raise TimeLimitExceeded(job._timeout) Aug 12 13:15:30 treeherder-prod app/worker_default.2: TimeLimitExceeded: TimeLimitExceeded(930,) Aug 12 13:15:30 treeherder-prod app/worker_default.2: [2018-08-12 12:15:30,079: ERROR/MainProcess] Hard time limit (930s) exceeded for fetch-bugs[c835d956-d571-4d69-8167-7c2f23bb37b5] Aug 12 13:15:35 treeherder-prod app/worker_default.2: [2018-08-12 12:15:34,766: ERROR/MainProcess] Process 'Worker-3' pid:41 exited with 'signal 9 (SIGKILL)' """ For variant 1, we're catching the `SoftTimeLimitExceeded` due to a generic `except Exception` (added in bug 1289574), and so ignoring it: https://github.com/mozilla/treeherder/blob/1fc7f31b37b56eaae92f9be5706ab0c85a609251/treeherder/etl/bugzilla.py#L55-L72 We should catch on a more specific exception type instead. For variant 2, I'm not entirely sure as to the cause - but given that there often seems to be a lot of DB warnings spew from the fetch_bugs tasks - perhaps the New Relic agent is taking longer than 30 seconds (the gap between the soft and hard time limits) to finish processing the transaction?
Actually for variant 2 there was a five minute gap, so no idea what happened there.

There have been no recent occurrences in the logs.

This was fixed by a combination of bug 1176492 and bug 1483301.

Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.