Closed Bug 1483267 Opened 7 years ago Closed 6 years ago

Log parser tasks sometimes ignore the soft time limit and hit TimeLimitExceeded

Categories

(Tree Management :: Treeherder: Data Ingestion, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: ghickman)

References

Details

Attachments

(1 file)

We have soft and hard celery task time limits set up, such that a task is first given a SoftTimeLimitExceeded exception nudge, and then later the whole worker process killed and a TimeLimitExceeded generated by the master process. Ideally we should never reach TimeLimitExceeded, since when that happens: (a) the New Relic agent of the child process wouldn't have had a chance to report the exception (b) the whole worker process has to restart, wasting time However looking at the Papertrail logs, I can see it happening for some log parser tasks: https://papertrailapp.com/systems/treeherder-prod/events?q="TimeLimitExceeded%3A TimeLimitExceeded" """ Aug 14 08:56:22 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:56:22,455] WARNING [treeherder.etl.perf:75] Performance framework job_resource_usage does not exist, skipping load of performance artifacts Aug 14 08:56:22 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:56:22,455: WARNING/Worker-248] Performance framework job_resource_usage does not exist, skipping load of performance artifacts Aug 14 08:57:25 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:25,094: WARNING/MainProcess] Soft time limit (900s) exceeded for log-parser[6187db73-f095-4716-a9aa-dbd0f217708c] Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:55,101: ERROR/MainProcess] Task log-parser[6187db73-f095-4716-a9aa-dbd0f217708c] raised unexpected: TimeLimitExceeded(930,) Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: Traceback (most recent call last): Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 645, in on_hard_timeout Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: raise TimeLimitExceeded(job._timeout) Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: TimeLimitExceeded: TimeLimitExceeded(930,) Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:55,103: ERROR/MainProcess] Hard time limit (930s) exceeded for log-parser[6187db73-f095-4716-a9aa-dbd0f217708c] Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:55,223: ERROR/MainProcess] Process 'Worker-241' pid:755 exited with 'signal 9 (SIGKILL)' """ The only reason I can see why this would happen, is if we are inadvertently catching the `SoftTimeLimitExceeded` exception and ignoring it, during log parsing. From a quick glance, it seems like the error batching here is fault: https://github.com/mozilla/treeherder/blob/1fc7f31b37b56eaae92f9be5706ab0c85a609251/treeherder/log_parser/tasks.py#L55-L66 ...in that: (a) when a SoftTimeLimitExceeded is received, the log parser stops the current parsing step, but continues on with the next, which may take longer than the difference between the soft and hard time limits (b) only the first Exception is raised, so if an earlier log parsing step failed, it would hide a later SoftTimeLimitExceeded We should probably just remove that batching.
George, what are your thoughts on removing the batching linked above? Much of the structure of log_parser/tasks.py is leftover from the days where we ran the individual steps (for each type of log) as separate tasks (plus there's cruft for handling legacy log names which may/may not still be being used by people submitting data), so the whole thing could probably do with a refactor anyway.
Flags: needinfo?(ghickman)
It could definitely do with some simplificiation from a quick look through. Are either of the builds-4h/buildbot_text log types going away any time soon? I have some vague recollection of buildbot ingestion being wrapped up but I'm not sure it's related to this? It looks like this task does two things: parse logs (3 types) & crossreference logs. Should we split those up into two different tasks? Do we know when a log has been parsed? I'm imagining a system where logs of each type are queued for parsing and at the end of the parser step we queue up a job to crossreference. The crossreference job makes sure all log types have been parsed before running and fast exits if not.
Flags: needinfo?(ghickman)
So the tasks were previously split up, however we had number of issues with broken synchronisation between them all (I'm not against trying again, but worth bearing in mind). Re builds-4h/buildbot_text, the "builds-4h" name is legacy and ideally would be removed (and to be honest the "buildbot_text" would also be better named something like "raw_text_log" etc). The names come from a few sources: * our own buildbot ETL layer, which we'll be able to entirely remove in the next 3-4 weeks (https://github.com/mozilla/treeherder/blob/e5e0a9faaecdcf92dd4b5b1158b08836c653812e/treeherder/etl/buildapi.py#L193-L198) * pulse messages from taskcluster (https://github.com/taskcluster/taskcluster-treeherder/blob/fedc8272f26f41de9e47f04d92355e98aaa0f174/src/handler.js#L59-L60) * pulse messages from any non-taskcluster submitters (currently only one: https://github.com/mozilla/fxtest-jenkins-pipeline/blob/c1e22572071172418cf5bff389ba3ffea0cd8119/vars/submitToTreeherder.groovy#L115) * REST API job submissions (we want to remove support for this too, but waiting on last people to migrate to Pulse, see bug 1349182)
I've just come across this again via some errors in NewRelic: https://rpm.newrelic.com/accounts/677903/applications/14179733/traced_errors/d391aa70-ab7e-11e8-b58d-0242ac110006_0_4622 After taking a second look at the exception batching specifically I think we have some options here. Correct me if I'm wrong but the way I'm interpretting this block (https://github.com/mozilla/treeherder/blob/934bad9187450732f9814a2eb0050be66c26aa6b/treeherder/log_parser/tasks.py#L54-L66) as: * loop log lines * get a parser * attempt parsing * if exceptions occurred, raise the first one. This means we exit the task if any log parsing failed but only _after_ parsing all the other logs. I see two main problems here: 1. we've broken the contract SoftTimeLimitExceeded provides (tidy up what you're doing and exit) 2. we're hiding N-1 exceptions skewing our metrics and potentially hiding exceptions We could handle this by simply not catching exceptions but then we lose further log parsing for that job. Instead we should make use of manual exception tracking with NewRelic. We can catch all exceptions, reraise if it's SoftTimeLimitExceeded and manually send to NewRelic (https://docs.newrelic.com/docs/agents/python-agent/python-agent-api/record_exception) if it's not. This lets us regain the correct usage of SoftTimeLimitExceeded, track all exceptions with parsing, and continues to parse all possible logs. This will feed more exception reports to NewRelic, but I see this as a good thing since we currently hide some. What do you think about this flow Ed?
Flags: needinfo?(emorley)
Sounds good to me :-)
Flags: needinfo?(emorley)
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/236f06d07c6c9dff4a0e28758332771a2d3c71d1 Bug 1483267 - Log parsing exceptions to NewRelic and continue parsing This allows us to respect Celery's SoftTimeLimitExceeded to shut down the current task while also continuing to parse further logs when other exceptions are raised.
Assignee: nobody → ghickman
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: