Log parsing is not working, shows as log parsing incomplete
Categories
(Tree Management :: Treeherder, defect)
Tracking
(Not tracked)
People
(Reporter: nataliaCs, Assigned: mozilla)
References
(Blocks 1 open bug)
Details
We're having a big number of failures which are having only their raw logs available.
Failures appear as having log parsing incomplete: e.g. push https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=0e1c1a720ca740cce683878abb6836de6520c9ce&selectedTaskRun=LS1UWMaBQLWd3_NJ9p2WaQ.0
Encountered ^ on the beta simulation https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=K9j3NgcCQdW5vgwVpJQ8Ag.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=bb19434b4f9edc62865afb0dfc734e5ec97ed18f and on another recent Try push: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=e18865eb0673361e2f4d381c6c8382ea25c1221c&selectedTaskRun=UEJTCNi5S_aDi2DHLwagSQ.0
Aryx has mentioned that maybe something on try has caused the delay in parsing, but as far as I checked I did not see anything obvious.
Closed autoland and raised issue on Element on #Treeherder, #FirefoxCI and on Slack as well.
Comment 1•4 years ago
|
||
The situation is unchanged:
- Logs for failed tasks are not getting parsed. The oldest unparsed tasks completed at 10:18 UTC, the latest parsed one from 10:37 UTC: https://sql.telemetry.mozilla.org/queries/76527/source
- There are memory alerts for the log parser for failed tasks in the #treeherder-ops channel: "Memory R15 (not web dyno)" search found 1 match in the past minute — https://papertrailapp.com/searches/87645472 - Nov 24 16:05:59 treeherder-prototype heroku/worker_log_parser_fail.2: Error R15 (Memory quota vastly exceeded)
- All Treeherder instances are affected:
** production
** staging
** prototype
** prototype2
Is parsing by TH instance? Then it is the data in the log which trips all.
Comment 2•4 years ago
•
|
||
Jon Buckley increased the instance size to the maximum and the backlog of unparsed logs decreased to ~280 (from ~1k) but the parser stopped making progress. He also noticed these error messages:
2020-11-24T17:09:35.537321+00:00 app[worker_log_parser_fail.1]: [2020-11-24 17:09:35,535: WARNING/ForkPoolWorker-1] Got OperationalError inserting failure_line
2020-11-24T17:09:40.082494+00:00 app[worker_log_parser_fail.1]: [2020-11-24 17:09:40,082] INFO [treeherder.log_parser.tasks:51] Skipping parsing for job 519884494 since log already processed. Log Status: 1
Trees remain closed except Try which gets reopened.
Comment 3•4 years ago
|
||
The maximum instance size has read all the messages and the processed the logs when possible. Logs for some autoland tasks (likely all containing check-in related failures) could not be processed, e.g. https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=bTot5GulRTmR1DKRcF2KBQ.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=f6f29c9cc744bff2c94903bed8b625c8cbac9118
Comment 4•4 years ago
|
||
Trees have been open for an hour without hitting the issue again.
Comment 5•4 years ago
•
|
||
Autoland is closed for the log parsing issue.
Comment 6•4 years ago
|
||
Marco pointed out some logs got huge, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/b8TA_bTSRWWa6U5qn1Ykgg
Comment 7•4 years ago
|
||
We should try to figure out when this started.
Comment 8•4 years ago
|
||
I opened Autoland ~50 minutes ago. Log parsing works as usual.
Comment hidden (Intermittent Failures Robot) |
Comment 10•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #6)
Marco pointed out some logs got huge, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/b8TA_bTSRWWa6U5qn1Ykgg
It does show in Treeherder that the log was skipped though: https://treeherder.mozilla.org/jobs?repo=autoland&revision=f6f29c9cc744bff2c94903bed8b625c8cbac9118&selectedTaskRun=b8TA_bTSRWWa6U5qn1Ykgg.0
It would help to know when exactly this was noticed - the first job - but that might not be the smoking gun.
Comment 11•4 years ago
|
||
We have a post-mortem scheduled for tomorrow. Here's the doc: https://docs.google.com/document/d/1umtnnuXci2X8aOdN6cPGH3DQrWJiRqyfgPFsG7CF5uo/edit#
Comment 12•4 years ago
|
||
In the post-mortem, we discussed two tasks to mitigate the impact of memory growth due to large log lines. We already have two queues: log_parser
and log_parser_fail
which split according to the result
of a task. But we want to split the log_parser_fail
down further.
- Separate the processing of the raw log and the
_errorsummary.log
to separate queues and workers - Separate log processing by the repository. One queue for the following. One queue for everything else:
autoland
mozilla-central
mozilla-beta
mozilla-release
mozilla-esr78
Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser
queue alone. It doesn't appear to be hitting any issues.
There is some technical debt in this area. But I will outline here the steps that would be taken to implement this.
- The main technical debt that is most likely just code that can be ripped out is https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L80-L95. And all the functions that this code calls that are not used elsewhere. We would also remove the two queues of
log_autoclassify
andlog_autoclassify_fail
. - This is where we decide whether to handle the logs for a task in the
fail
queue, or the normal one: https://github.com/mozilla/treeherder/blob/d251842d30b61dc41070eb2b5f14bd8f18fb2bc1/treeherder/etl/jobs.py#L337-L344 - We should add logic to the
!= 'success'
section to split it between the raw log and theerrorsummary.log
. I propose creating two new queues oflog_parser_fail_raw
andlog_parser_fail_json
(but other names would work). - We need to modify
parse_logs
here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to haveparse_logs
no longer be an async task. The logic in there decides whatparser
function to use for the raw or the json (errorsummary) log. I think changing this so that theparser
s are now async tasks that get called, placing the task in the appropriate queue.
Comment 13•4 years ago
•
|
||
Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts
and I think also in the parser itself) but not for the error summary logs.
Edit: but otherwise, that proposal sounds good :)
Comment 14•4 years ago
|
||
Following on comment 12, I forgot to add a few more steps:
- This file is where we declare the worker (dyno) allocation for the new queues: https://github.com/mozilla/treeherder/blob/7587fb224e800fa7ecab7dd39ac5db43bab33ea0/Procfile#L46-L47 . So lines will need to be added here to enable handling of the new queues.
a. Once a PR has been deployed, the worker will be created, but not yet provisioned. You'll need to go into Heroku (https://dashboard.heroku.com/apps/treeherder-prototype/resources) and set the type and number of dynos.
b. Also here: https://github.com/mozilla/treeherder/blob/22d854be68a3ef5884aacf1424b27a52824c2fbd/app.json#L78 - The new queues also need to be specified here: https://github.com/mozilla/treeherder/blob/1d8db11414501096572fb55e1f5f7b261a067b2a/treeherder/config/settings.py#L328-L329
- Update documentation line 62 here: https://github.com/mozilla/treeherder/blob/c9b86aa40305b7f9e607129fd833c27cbac36b15/docs/pulseload.md#L62
Comment 15•4 years ago
|
||
Jira issue is here: https://jira.mozilla.com/browse/RELENG-158
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 16•4 years ago
|
||
WIP: https://github.com/mozilla/treeherder/pull/6891
(In reply to Jesse Schwartzentruber (:truber) from comment #17)
(In reply to Aki Sasaki [:aki] (he/him) from comment #16)
How do we currently schedule fuzzing tasks?
(In reply to Cameron Dawson [:camd] from comment #12)
In the post-mortem, we discussed two tasks to mitigate the impact of memory growth due to large log lines. We already have two queues:log_parser
andlog_parser_fail
which split according to theresult
of a task. But we want to split thelog_parser_fail
down further.
- Separate the processing of the raw log and the
_errorsummary.log
to separate queues and workers
Your instructions seem to cover this very well.
- Separate log processing by the repository. One queue for the following. One queue for everything else:
autoland
mozilla-central
mozilla-beta
mozilla-release
mozilla-esr78
Your instructions don't seem to cover this. Is there an easy way to tell which project/repo we're on? That means instead of 3 queues we'll have 6, yes? Or are we not doing this?
Obviously, this split will mean 4 separate queues altogether. We will leave the
log_parser
queue alone. It doesn't appear to be hitting any issues.
4 separate fail queues? or 4 total queues? I count 3 or 6.
There is some technical debt in this area. But I will outline here the steps that would be taken to implement this.
- The main technical debt that is most likely just code that can be ripped out is https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L80-L95. And all the functions that this code calls that are not used elsewhere. We would also remove the two queues of
log_autoclassify
andlog_autoclassify_fail
.
Done.
- This is where we decide whether to handle the logs for a task in the
fail
queue, or the normal one: https://github.com/mozilla/treeherder/blob/d251842d30b61dc41070eb2b5f14bd8f18fb2bc1/treeherder/etl/jobs.py#L337-L344- We should add logic to the
!= 'success'
section to split it between the raw log and theerrorsummary.log
. I propose creating two new queues oflog_parser_fail_raw
andlog_parser_fail_json
(but other names would work).
Done. At this point we have 3 queues, yes?
- We need to modify
parse_logs
here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to haveparse_logs
no longer be an async task. The logic in there decides whatparser
function to use for the raw or the json (errorsummary) log. I think changing this so that theparser
s are now async tasks that get called, placing the task in the appropriate queue.
I'm not understanding this 100% but I added some guesses to the PR.
(In reply to Sarah Clements [:sclements] from comment #13)
Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in
post_log_artifacts
and I think also in the parser itself) but not for the error summary logs.
I'm not finding where I'd hook this in... in parse_logs
?
(In reply to Cameron Dawson [:camd] from comment #14)
Following on comment 12, I forgot to add a few more steps:
- This file is where we declare the worker (dyno) allocation for the new queues: https://github.com/mozilla/treeherder/blob/7587fb224e800fa7ecab7dd39ac5db43bab33ea0/Procfile#L46-L47 . So lines will need to be added here to enable handling of the new queues.
Done.
a. Once a PR has been deployed, the worker will be created, but not yet provisioned. You'll need to go into Heroku (https://dashboard.heroku.com/apps/treeherder-prototype/resources) and set the type and number of dynos.
b. Also here: https://github.com/mozilla/treeherder/blob/22d854be68a3ef5884aacf1424b27a52824c2fbd/app.json#L78
I did (b). Not (a) yet.
- The new queues also need to be specified here: https://github.com/mozilla/treeherder/blob/1d8db11414501096572fb55e1f5f7b261a067b2a/treeherder/config/settings.py#L328-L329
Done. 3 queues?
- Update documentation line 62 here: https://github.com/mozilla/treeherder/blob/c9b86aa40305b7f9e607129fd833c27cbac36b15/docs/pulseload.md#L62
Waiting til I'm closer to done.
Comment 17•4 years ago
|
||
(In reply to Aki Sasaki [:aki] (he/him) from comment #16)
Your instructions don't seem to cover this. Is there an easy way to tell which project/repo we're on? That means instead of 3 queues we'll have 6, yes? Or are we not doing this?
You can get the repository from the job
object, via the ORM. But that might require adding some select_related
fields when getting the job
object. I think a better way would be to pass in the repository
from the jobs._load_job
function into the call to _schedule_log_parsing
at line 303 of that file.
Obviously, this split will mean 4 separate queues altogether. We will leave the
log_parser
queue alone. It doesn't appear to be hitting any issues.4 separate fail queues? or 4 total queues? I count 3 or 6.
I was thinking just about the fail queues:
- log_parser_fail_raw
- log_parser_fail_json
or
- log_parser_fail_raw_sheriffed
- log_parser_fail_raw_unsheriffed
- log_parser_fail_json_sheriffed
- log_parser_fail_json_unsheriffed
- We should add logic to the
!= 'success'
section to split it between the raw log and theerrorsummary.log
. I propose creating two new queues oflog_parser_fail_raw
andlog_parser_fail_json
(but other names would work).Done. At this point we have 3 queues, yes?
Including the non-faillog_parser
, 3 yes.
- We need to modify
parse_logs
here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to haveparse_logs
no longer be an async task. The logic in there decides whatparser
function to use for the raw or the json (errorsummary) log. I think changing this so that theparser
s are now async tasks that get called, placing the task in the appropriate queue.I'm not understanding this 100% but I added some guesses to the PR.
Yeah, my apologies, I was having a hard time figuring out how to explain this in the bug without getting overly verbose. :).
Comment 18•4 years ago
|
||
(In reply to Aki Sasaki [:aki] (he/him) from comment #16)
(In reply to Sarah Clements [:sclements] from comment #13)
Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in
post_log_artifacts
and I think also in the parser itself) but not for the error summary logs.I'm not finding where I'd hook this in... in
parse_logs
?
I think this would need to be in two places:
- check for the size in
store_failure_lines
here: https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/tasks.py#L34 (so doing the same as whatpost_log_artifacts
is doing) - Truncating the size stored in
write_failure_lines
- could do the same as what the perfherder parser is doing (for the raw logs): https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/artifactbuilders.py#L46
Although, with 1 if we start skipping more logs - raw or error summaries - because they have suddenly grown too big then we'll be missing a lot more data. I think a question that hasn't been asked is why some of these logs have suddenly gotten bigger and if it's necessary for them to be that large. Not sure who we could ask about that.
Comment 19•4 years ago
|
||
Here are some questions answered by Sebastian Hengst (Aryx) asked by Sarah Clements:
When the log parsing issue hits, what is the level of urgency?
- Try should be taking tasks all the time if possible: some changes require iterative testing on Try
- outside working hours (Fri 6pm PT - Sun 11am PT): We can live with autoland closed during these hours. The issue is the Monday on which a closed tree prevents landing patches for APAC and Europe. They can request the landing but there might be conflicts between patches and a big queue of patches can cause second level effects (e.g. bug 1678836 hours of waiting for tasks starting possible if many pushes in a short time).
- When Firefox needs to do an emergency release ("chemspill") for a security vulnerability exploited in the wild, the plan is to have the time from the report to shipping the fix in less than 24h. (The "addongeddon" and the Firefox 82 dot release last month happened on a weekend). RelMan knows if such a release is being worked on and as the problem as to be analyzed and the patch has to be developed first, there would be some time to fix a Treeherder issue (if it has been determined what the TH issue is before and we can estimate how long a fix would take).
What does it mean for the log parser issue:
- outside working hours (Fri 6pm PT - Sun 11am PT): We can live with it but need it fixed when New Zealand starts to work.
Is there a workaround (looking at raw logs when failure lines aren't available, for example)?
- No, there are too many failing tasks and too many rules to do this manually.
What trees are the highest priority (especially if it happens during a week where all of the moco staff are not working and people shouldn't be pushing code)?
- mozilla-release and mozilla-esr have highest priority, beta should also be regarded high, followed by autoland and central. Having Try closed is causing a productivity loss across the Firefox Org, log parsing delays up to 2 hours for it are acceptable.
- There are some groups which work during MoCo offtime: development contractors (e.g. Igalia - web platform features, SoftVision - Firefox for Android), performance sheriffs (in some cases they need Try to check if a change caused a regression), people working on their goals. For comparison, last week had ~30% of the push volume of usual weeks.
Assignee | ||
Comment 20•4 years ago
|
||
Initial stab at max error size in https://github.com/mozilla/treeherder/pull/6895 . I'm not sure what limits we should set to avoid dropping valid errors, while protecting the service from bloated error logs.
Assignee | ||
Comment 21•4 years ago
|
||
We rolled out the queue split today. I filed bug 1681172 for following up on the max error size, if we decide to do that.
I also have privs to resize dynos.
Updated•3 years ago
|
Description
•