Closed
Bug 1483267
Opened 7 years ago
Closed 6 years ago
Log parser tasks sometimes ignore the soft time limit and hit TimeLimitExceeded
Categories
(Tree Management :: Treeherder: Data Ingestion, enhancement, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: ghickman)
References
Details
Attachments
(1 file)
We have soft and hard celery task time limits set up, such that a task is first given a SoftTimeLimitExceeded exception nudge, and then later the whole worker process killed and a TimeLimitExceeded generated by the master process.
Ideally we should never reach TimeLimitExceeded, since when that happens:
(a) the New Relic agent of the child process wouldn't have had a chance to report the exception
(b) the whole worker process has to restart, wasting time
However looking at the Papertrail logs, I can see it happening for some log parser tasks:
https://papertrailapp.com/systems/treeherder-prod/events?q="TimeLimitExceeded%3A TimeLimitExceeded"
"""
Aug 14 08:56:22 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:56:22,455] WARNING [treeherder.etl.perf:75] Performance framework job_resource_usage does not exist, skipping load of performance artifacts
Aug 14 08:56:22 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:56:22,455: WARNING/Worker-248] Performance framework job_resource_usage does not exist, skipping load of performance artifacts
Aug 14 08:57:25 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:25,094: WARNING/MainProcess] Soft time limit (900s) exceeded for log-parser[6187db73-f095-4716-a9aa-dbd0f217708c]
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:55,101: ERROR/MainProcess] Task log-parser[6187db73-f095-4716-a9aa-dbd0f217708c] raised unexpected: TimeLimitExceeded(930,)
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: Traceback (most recent call last):
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 645, in on_hard_timeout
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: raise TimeLimitExceeded(job._timeout)
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: TimeLimitExceeded: TimeLimitExceeded(930,)
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:55,103: ERROR/MainProcess] Hard time limit (930s) exceeded for log-parser[6187db73-f095-4716-a9aa-dbd0f217708c]
Aug 14 08:57:55 treeherder-prod app/worker_log_parser.4: [2018-08-14 07:57:55,223: ERROR/MainProcess] Process 'Worker-241' pid:755 exited with 'signal 9 (SIGKILL)'
"""
The only reason I can see why this would happen, is if we are inadvertently catching the `SoftTimeLimitExceeded` exception and ignoring it, during log parsing.
From a quick glance, it seems like the error batching here is fault:
https://github.com/mozilla/treeherder/blob/1fc7f31b37b56eaae92f9be5706ab0c85a609251/treeherder/log_parser/tasks.py#L55-L66
...in that:
(a) when a SoftTimeLimitExceeded is received, the log parser stops the current parsing step, but continues on with the next, which may take longer than the difference between the soft and hard time limits
(b) only the first Exception is raised, so if an earlier log parsing step failed, it would hide a later SoftTimeLimitExceeded
We should probably just remove that batching.
Reporter | ||
Comment 1•7 years ago
|
||
George, what are your thoughts on removing the batching linked above?
Much of the structure of log_parser/tasks.py is leftover from the days where we ran the individual steps (for each type of log) as separate tasks (plus there's cruft for handling legacy log names which may/may not still be being used by people submitting data), so the whole thing could probably do with a refactor anyway.
Flags: needinfo?(ghickman)
Assignee | ||
Comment 2•7 years ago
|
||
It could definitely do with some simplificiation from a quick look through. Are either of the builds-4h/buildbot_text log types going away any time soon? I have some vague recollection of buildbot ingestion being wrapped up but I'm not sure it's related to this?
It looks like this task does two things: parse logs (3 types) & crossreference logs. Should we split those up into two different tasks? Do we know when a log has been parsed? I'm imagining a system where logs of each type are queued for parsing and at the end of the parser step we queue up a job to crossreference. The crossreference job makes sure all log types have been parsed before running and fast exits if not.
Flags: needinfo?(ghickman)
Reporter | ||
Comment 3•7 years ago
|
||
So the tasks were previously split up, however we had number of issues with broken synchronisation between them all (I'm not against trying again, but worth bearing in mind).
Re builds-4h/buildbot_text, the "builds-4h" name is legacy and ideally would be removed (and to be honest the "buildbot_text" would also be better named something like "raw_text_log" etc). The names come from a few sources:
* our own buildbot ETL layer, which we'll be able to entirely remove in the next 3-4 weeks (https://github.com/mozilla/treeherder/blob/e5e0a9faaecdcf92dd4b5b1158b08836c653812e/treeherder/etl/buildapi.py#L193-L198)
* pulse messages from taskcluster (https://github.com/taskcluster/taskcluster-treeherder/blob/fedc8272f26f41de9e47f04d92355e98aaa0f174/src/handler.js#L59-L60)
* pulse messages from any non-taskcluster submitters (currently only one: https://github.com/mozilla/fxtest-jenkins-pipeline/blob/c1e22572071172418cf5bff389ba3ffea0cd8119/vars/submitToTreeherder.groovy#L115)
* REST API job submissions (we want to remove support for this too, but waiting on last people to migrate to Pulse, see bug 1349182)
Assignee | ||
Comment 4•6 years ago
|
||
I've just come across this again via some errors in NewRelic: https://rpm.newrelic.com/accounts/677903/applications/14179733/traced_errors/d391aa70-ab7e-11e8-b58d-0242ac110006_0_4622
After taking a second look at the exception batching specifically I think we have some options here. Correct me if I'm wrong but the way I'm interpretting this block (https://github.com/mozilla/treeherder/blob/934bad9187450732f9814a2eb0050be66c26aa6b/treeherder/log_parser/tasks.py#L54-L66) as:
* loop log lines
* get a parser
* attempt parsing
* if exceptions occurred, raise the first one.
This means we exit the task if any log parsing failed but only _after_ parsing all the other logs. I see two main problems here:
1. we've broken the contract SoftTimeLimitExceeded provides (tidy up what you're doing and exit)
2. we're hiding N-1 exceptions skewing our metrics and potentially hiding exceptions
We could handle this by simply not catching exceptions but then we lose further log parsing for that job. Instead we should make use of manual exception tracking with NewRelic. We can catch all exceptions, reraise if it's SoftTimeLimitExceeded and manually send to NewRelic (https://docs.newrelic.com/docs/agents/python-agent/python-agent-api/record_exception) if it's not. This lets us regain the correct usage of SoftTimeLimitExceeded, track all exceptions with parsing, and continues to parse all possible logs.
This will feed more exception reports to NewRelic, but I see this as a good thing since we currently hide some.
What do you think about this flow Ed?
Flags: needinfo?(emorley)
Comment 6•6 years ago
|
||
Comment 7•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/236f06d07c6c9dff4a0e28758332771a2d3c71d1
Bug 1483267 - Log parsing exceptions to NewRelic and continue parsing
This allows us to respect Celery's SoftTimeLimitExceeded to shut down
the current task while also continuing to parse further logs when other
exceptions are raised.
Reporter | ||
Updated•6 years ago
|
Assignee: nobody → ghickman
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•