1679162 - Log parsing is not working, shows as log parsing incomplete

Reporter

Description

•

5 years ago

We're having a big number of failures which are having only their raw logs available.

Failures appear as having log parsing incomplete: e.g. push https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=0e1c1a720ca740cce683878abb6836de6520c9ce&selectedTaskRun=LS1UWMaBQLWd3_NJ9p2WaQ.0

Encountered ^ on the beta simulation https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=K9j3NgcCQdW5vgwVpJQ8Ag.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=bb19434b4f9edc62865afb0dfc734e5ec97ed18f and on another recent Try push: https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=e18865eb0673361e2f4d381c6c8382ea25c1221c&selectedTaskRun=UEJTCNi5S_aDi2DHLwagSQ.0

Aryx has mentioned that maybe something on try has caused the delay in parsing, but as far as I checked I did not see anything obvious.

Closed autoland and raised issue on Element on #Treeherder, #FirefoxCI and on Slack as well.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 1

•

5 years ago

The situation is unchanged:

Logs for failed tasks are not getting parsed. The oldest unparsed tasks completed at 10:18 UTC, the latest parsed one from 10:37 UTC: https://sql.telemetry.mozilla.org/queries/76527/source
There are memory alerts for the log parser for failed tasks in the #treeherder-ops channel: "Memory R15 (not web dyno)" search found 1 match in the past minute — https://papertrailapp.com/searches/87645472 - Nov 24 16:05:59 treeherder-prototype heroku/worker_log_parser_fail.2: Error R15 (Memory quota vastly exceeded)
All Treeherder instances are affected:
** production
** staging
** prototype
** prototype2
Is parsing by TH instance? Then it is the data in the log which trips all.

Severity: -- → S1

Flags: needinfo?(sclements)

Flags: needinfo?(mtabara)

Flags: needinfo?(cdawson)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 2

•

5 years ago

•

Edited

Jon Buckley increased the instance size to the maximum and the backlog of unparsed logs decreased to ~280 (from ~1k) but the parser stopped making progress. He also noticed these error messages:

    2020-11-24T17:09:35.537321+00:00 app[worker_log_parser_fail.1]: [2020-11-24 17:09:35,535: WARNING/ForkPoolWorker-1] Got OperationalError inserting failure_line
    2020-11-24T17:09:40.082494+00:00 app[worker_log_parser_fail.1]: [2020-11-24 17:09:40,082] INFO [treeherder.log_parser.tasks:51] Skipping parsing for job 519884494 since log already processed.  Log Status: 1

Trees remain closed except Try which gets reopened.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 3

•

5 years ago

The maximum instance size has read all the messages and the processed the logs when possible. Logs for some autoland tasks (likely all containing check-in related failures) could not be processed, e.g. https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=bTot5GulRTmR1DKRcF2KBQ.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=f6f29c9cc744bff2c94903bed8b625c8cbac9118

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 4

•

5 years ago

Trees have been open for an hour without hitting the issue again.

Summary: [CLOSED TREES] Log parsing is not working, shows as log parsing incomplete → Log parsing is not working, shows as log parsing incomplete

Bogdan Tara[:bogdan_tara | bogdant]

Comment 5

•

5 years ago

•

Edited

Autoland is closed for the log parsing issue.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 6

•

5 years ago

Marco pointed out some logs got huge, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/b8TA_bTSRWWa6U5qn1Ykgg

Marco Castelluccio [:marco]

Comment 7

•

5 years ago

We should try to figure out when this started.

Bogdan Tara[:bogdan_tara | bogdant]

Comment 8

•

5 years ago

I opened Autoland ~50 minutes ago. Log parsing works as usual.

Comment hidden (Intermittent Failures Robot)

Sarah Clements [:sclements]

Comment 10

•

5 years ago

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #6)

Marco pointed out some logs got huge, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/b8TA_bTSRWWa6U5qn1Ykgg

It does show in Treeherder that the log was skipped though: https://treeherder.mozilla.org/jobs?repo=autoland&revision=f6f29c9cc744bff2c94903bed8b625c8cbac9118&selectedTaskRun=b8TA_bTSRWWa6U5qn1Ykgg.0

It would help to know when exactly this was noticed - the first job - but that might not be the smoking gun.

Cameron Dawson [:camd]

Comment 11

•

5 years ago

We have a post-mortem scheduled for tomorrow. Here's the doc: https://docs.google.com/document/d/1umtnnuXci2X8aOdN6cPGH3DQrWJiRqyfgPFsG7CF5uo/edit#

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Comment 12

•

5 years ago

In the post-mortem, we discussed two tasks to mitigate the impact of memory growth due to large log lines. We already have two queues: log_parser and log_parser_fail which split according to the result of a task. But we want to split the log_parser_fail down further.

Separate the processing of the raw log and the _errorsummary.log to separate queues and workers
Separate log processing by the repository. One queue for the following. One queue for everything else:
autoland
mozilla-central
mozilla-beta
mozilla-release
mozilla-esr78

Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser queue alone. It doesn't appear to be hitting any issues.

There is some technical debt in this area. But I will outline here the steps that would be taken to implement this.

The main technical debt that is most likely just code that can be ripped out is https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L80-L95. And all the functions that this code calls that are not used elsewhere. We would also remove the two queues of log_autoclassify and log_autoclassify_fail.
This is where we decide whether to handle the logs for a task in the fail queue, or the normal one: https://github.com/mozilla/treeherder/blob/d251842d30b61dc41070eb2b5f14bd8f18fb2bc1/treeherder/etl/jobs.py#L337-L344
We should add logic to the != 'success' section to split it between the raw log and the errorsummary.log. I propose creating two new queues of log_parser_fail_raw and log_parser_fail_json (but other names would work).
We need to modify parse_logs here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to have parse_logs no longer be an async task. The logic in there decides what parser function to use for the raw or the json (errorsummary) log. I think changing this so that the parsers are now async tasks that get called, placing the task in the appropriate queue.

Sarah Clements [:sclements]

Comment 13

•

5 years ago

•

Edited

Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts and I think also in the parser itself) but not for the error summary logs.

Edit: but otherwise, that proposal sounds good :)

Cameron Dawson [:camd]

Comment 14

•

5 years ago

Following on comment 12, I forgot to add a few more steps:

This file is where we declare the worker (dyno) allocation for the new queues: https://github.com/mozilla/treeherder/blob/7587fb224e800fa7ecab7dd39ac5db43bab33ea0/Procfile#L46-L47 . So lines will need to be added here to enable handling of the new queues.
a. Once a PR has been deployed, the worker will be created, but not yet provisioned. You'll need to go into Heroku (https://dashboard.heroku.com/apps/treeherder-prototype/resources) and set the type and number of dynos.
b. Also here: https://github.com/mozilla/treeherder/blob/22d854be68a3ef5884aacf1424b27a52824c2fbd/app.json#L78
The new queues also need to be specified here: https://github.com/mozilla/treeherder/blob/1d8db11414501096572fb55e1f5f7b261a067b2a/treeherder/config/settings.py#L328-L329
Update documentation line 62 here: https://github.com/mozilla/treeherder/blob/c9b86aa40305b7f9e607129fd833c27cbac36b15/docs/pulseload.md#L62

Cameron Dawson [:camd]

Comment 15

•

5 years ago

Jira issue is here: https://jira.mozilla.com/browse/RELENG-158

Aki Sasaki (not active)

Assignee

Updated

•

5 years ago

Assignee: nobody → aki

Aki Sasaki (not active)

Assignee

Comment 16

•

5 years ago

WIP: https://github.com/mozilla/treeherder/pull/6891

(In reply to Jesse Schwartzentruber (:truber) from comment #17)

(In reply to Aki Sasaki [:aki] (he/him) from comment #16)

How do we currently schedule fuzzing tasks?
(In reply to Cameron Dawson [:camd] from comment #12)
In the post-mortem, we discussed two tasks to mitigate the impact of memory growth due to large log lines. We already have two queues: log_parser and log_parser_fail which split according to the result of a task. But we want to split the log_parser_fail down further.

Separate the processing of the raw log and the _errorsummary.log to separate queues and workers

Your instructions seem to cover this very well.

Separate log processing by the repository. One queue for the following. One queue for everything else:
autoland
mozilla-central
mozilla-beta
mozilla-release
mozilla-esr78

Your instructions don't seem to cover this. Is there an easy way to tell which project/repo we're on? That means instead of 3 queues we'll have 6, yes? Or are we not doing this?

Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser queue alone. It doesn't appear to be hitting any issues.

4 separate fail queues? or 4 total queues? I count 3 or 6.

There is some technical debt in this area. But I will outline here the steps that would be taken to implement this.

The main technical debt that is most likely just code that can be ripped out is https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L80-L95. And all the functions that this code calls that are not used elsewhere. We would also remove the two queues of log_autoclassify and log_autoclassify_fail.

Done.

This is where we decide whether to handle the logs for a task in the fail queue, or the normal one: https://github.com/mozilla/treeherder/blob/d251842d30b61dc41070eb2b5f14bd8f18fb2bc1/treeherder/etl/jobs.py#L337-L344

We should add logic to the != 'success' section to split it between the raw log and the errorsummary.log. I propose creating two new queues of log_parser_fail_raw and log_parser_fail_json (but other names would work).

Done. At this point we have 3 queues, yes?

We need to modify parse_logs here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to have parse_logs no longer be an async task. The logic in there decides what parser function to use for the raw or the json (errorsummary) log. I think changing this so that the parsers are now async tasks that get called, placing the task in the appropriate queue.

I'm not understanding this 100% but I added some guesses to the PR.

(In reply to Sarah Clements [:sclements] from comment #13)

Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts and I think also in the parser itself) but not for the error summary logs.

I'm not finding where I'd hook this in... in parse_logs ?

(In reply to Cameron Dawson [:camd] from comment #14)

Following on comment 12, I forgot to add a few more steps:

This file is where we declare the worker (dyno) allocation for the new queues: https://github.com/mozilla/treeherder/blob/7587fb224e800fa7ecab7dd39ac5db43bab33ea0/Procfile#L46-L47 . So lines will need to be added here to enable handling of the new queues.

Done.

a. Once a PR has been deployed, the worker will be created, but not yet provisioned. You'll need to go into Heroku (https://dashboard.heroku.com/apps/treeherder-prototype/resources) and set the type and number of dynos.
b. Also here: https://github.com/mozilla/treeherder/blob/22d854be68a3ef5884aacf1424b27a52824c2fbd/app.json#L78

I did (b). Not (a) yet.

The new queues also need to be specified here: https://github.com/mozilla/treeherder/blob/1d8db11414501096572fb55e1f5f7b261a067b2a/treeherder/config/settings.py#L328-L329

Done. 3 queues?

Update documentation line 62 here: https://github.com/mozilla/treeherder/blob/c9b86aa40305b7f9e607129fd833c27cbac36b15/docs/pulseload.md#L62

Waiting til I'm closer to done.

Flags: needinfo?(sclements)

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Comment 17

•

5 years ago

(In reply to Aki Sasaki [:aki] (he/him) from comment #16)

Your instructions don't seem to cover this. Is there an easy way to tell which project/repo we're on? That means instead of 3 queues we'll have 6, yes? Or are we not doing this?

You can get the repository from the job object, via the ORM. But that might require adding some select_related fields when getting the job object. I think a better way would be to pass in the repository from the jobs._load_job function into the call to _schedule_log_parsing at line 303 of that file.

Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser queue alone. It doesn't appear to be hitting any issues.

4 separate fail queues? or 4 total queues? I count 3 or 6.

I was thinking just about the fail queues:

log_parser_fail_raw
log_parser_fail_json

or

log_parser_fail_raw_sheriffed
log_parser_fail_raw_unsheriffed
log_parser_fail_json_sheriffed
log_parser_fail_json_unsheriffed

We should add logic to the != 'success' section to split it between the raw log and the errorsummary.log. I propose creating two new queues of log_parser_fail_raw and log_parser_fail_json (but other names would work).

Done. At this point we have 3 queues, yes?
Including the non-fail log_parser, 3 yes.

We need to modify parse_logs here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to have parse_logs no longer be an async task. The logic in there decides what parser function to use for the raw or the json (errorsummary) log. I think changing this so that the parsers are now async tasks that get called, placing the task in the appropriate queue.

I'm not understanding this 100% but I added some guesses to the PR.

Yeah, my apologies, I was having a hard time figuring out how to explain this in the bug without getting overly verbose. :).

Flags: needinfo?(cdawson)

Sarah Clements [:sclements]

Comment 18

•

5 years ago

(In reply to Aki Sasaki [:aki] (he/him) from comment #16)

(In reply to Sarah Clements [:sclements] from comment #13)

Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts and I think also in the parser itself) but not for the error summary logs.

I'm not finding where I'd hook this in... in parse_logs ?

I think this would need to be in two places:

check for the size in store_failure_lines here: https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/tasks.py#L34 (so doing the same as what post_log_artifacts is doing)
Truncating the size stored in write_failure_lines - could do the same as what the perfherder parser is doing (for the raw logs): https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/artifactbuilders.py#L46

Although, with 1 if we start skipping more logs - raw or error summaries - because they have suddenly grown too big then we'll be missing a lot more data. I think a question that hasn't been asked is why some of these logs have suddenly gotten bigger and if it's necessary for them to be that large. Not sure who we could ask about that.

Flags: needinfo?(sclements)

Cameron Dawson [:camd]

Comment 19

•

5 years ago

Here are some questions answered by Sebastian Hengst (Aryx) asked by Sarah Clements:

When the log parsing issue hits, what is the level of urgency?

Try should be taking tasks all the time if possible: some changes require iterative testing on Try
outside working hours (Fri 6pm PT - Sun 11am PT): We can live with autoland closed during these hours. The issue is the Monday on which a closed tree prevents landing patches for APAC and Europe. They can request the landing but there might be conflicts between patches and a big queue of patches can cause second level effects (e.g. bug 1678836 hours of waiting for tasks starting possible if many pushes in a short time).
When Firefox needs to do an emergency release ("chemspill") for a security vulnerability exploited in the wild, the plan is to have the time from the report to shipping the fix in less than 24h. (The "addongeddon" and the Firefox 82 dot release last month happened on a weekend). RelMan knows if such a release is being worked on and as the problem as to be analyzed and the patch has to be developed first, there would be some time to fix a Treeherder issue (if it has been determined what the TH issue is before and we can estimate how long a fix would take).

What does it mean for the log parser issue:

outside working hours (Fri 6pm PT - Sun 11am PT): We can live with it but need it fixed when New Zealand starts to work.

Is there a workaround (looking at raw logs when failure lines aren't available, for example)?

No, there are too many failing tasks and too many rules to do this manually.

What trees are the highest priority (especially if it happens during a week where all of the moco staff are not working and people shouldn't be pushing code)?

mozilla-release and mozilla-esr have highest priority, beta should also be regarded high, followed by autoland and central. Having Try closed is causing a productivity loss across the Firefox Org, log parsing delays up to 2 hours for it are acceptable.
There are some groups which work during MoCo offtime: development contractors (e.g. Igalia - web platform features, SoftVision - Firefox for Android), performance sheriffs (in some cases they need Try to check if a change caused a regression), people working on their goals. For comparison, last week had ~30% of the push volume of usual weeks.

Cameron Dawson [:camd]

Updated

•

5 years ago

Comment 20

•

5 years ago

Initial stab at max error size in https://github.com/mozilla/treeherder/pull/6895 . I'm not sure what limits we should set to avoid dropping valid errors, while protecting the service from bloated error logs.

Aki Sasaki (not active)

Assignee

Updated

•

5 years ago

Blocks: 1681172

Aki Sasaki (not active)

Assignee

Comment 21

•

5 years ago

We rolled out the queue split today. I filed bug 1681172 for following up on the max error size, if we decide to do that.
I also have privs to resize dynos.

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(mtabara)

Resolution: --- → FIXED

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

5 years ago

Regressions: 1707635

Sarah Clements [:sclements]

Updated

•

5 years ago

No longer regressions: 1707635

Nobody; OK to take it and work on it

Updated

•

4 years ago

Component: Treeherder: Log Parsing & Classification → TreeHerder