Closed Bug 1679162 Opened 4 years ago Closed 4 years ago

Log parsing is not working, shows as log parsing incomplete

Categories

(Tree Management :: Treeherder, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nataliaCs, Assigned: mozilla)

References

(Blocks 1 open bug)

Details

The situation is unchanged:

  • Logs for failed tasks are not getting parsed. The oldest unparsed tasks completed at 10:18 UTC, the latest parsed one from 10:37 UTC: https://sql.telemetry.mozilla.org/queries/76527/source
  • There are memory alerts for the log parser for failed tasks in the #treeherder-ops channel: "Memory R15 (not web dyno)" search found 1 match in the past minute — https://papertrailapp.com/searches/87645472 - Nov 24 16:05:59 treeherder-prototype heroku/worker_log_parser_fail.2: Error R15 (Memory quota vastly exceeded)
  • All Treeherder instances are affected:
    ** production
    ** staging
    ** prototype
    ** prototype2
    Is parsing by TH instance? Then it is the data in the log which trips all.
Severity: -- → S1
Flags: needinfo?(sclements)
Flags: needinfo?(mtabara)
Flags: needinfo?(cdawson)

Jon Buckley increased the instance size to the maximum and the backlog of unparsed logs decreased to ~280 (from ~1k) but the parser stopped making progress. He also noticed these error messages:

    2020-11-24T17:09:35.537321+00:00 app[worker_log_parser_fail.1]: [2020-11-24 17:09:35,535: WARNING/ForkPoolWorker-1] Got OperationalError inserting failure_line
    2020-11-24T17:09:40.082494+00:00 app[worker_log_parser_fail.1]: [2020-11-24 17:09:40,082] INFO [treeherder.log_parser.tasks:51] Skipping parsing for job 519884494 since log already processed.  Log Status: 1

Trees remain closed except Try which gets reopened.

The maximum instance size has read all the messages and the processed the logs when possible. Logs for some autoland tasks (likely all containing check-in related failures) could not be processed, e.g. https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=bTot5GulRTmR1DKRcF2KBQ.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=f6f29c9cc744bff2c94903bed8b625c8cbac9118

Trees have been open for an hour without hitting the issue again.

Summary: [CLOSED TREES] Log parsing is not working, shows as log parsing incomplete → Log parsing is not working, shows as log parsing incomplete

Autoland is closed for the log parsing issue.

We should try to figure out when this started.

I opened Autoland ~50 minutes ago. Log parsing works as usual.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #6)

Marco pointed out some logs got huge, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/b8TA_bTSRWWa6U5qn1Ykgg

It does show in Treeherder that the log was skipped though: https://treeherder.mozilla.org/jobs?repo=autoland&revision=f6f29c9cc744bff2c94903bed8b625c8cbac9118&selectedTaskRun=b8TA_bTSRWWa6U5qn1Ykgg.0

It would help to know when exactly this was noticed - the first job - but that might not be the smoking gun.

We have a post-mortem scheduled for tomorrow. Here's the doc: https://docs.google.com/document/d/1umtnnuXci2X8aOdN6cPGH3DQrWJiRqyfgPFsG7CF5uo/edit#

Flags: needinfo?(cdawson)

In the post-mortem, we discussed two tasks to mitigate the impact of memory growth due to large log lines. We already have two queues: log_parser and log_parser_fail which split according to the result of a task. But we want to split the log_parser_fail down further.

  1. Separate the processing of the raw log and the _errorsummary.log to separate queues and workers
  2. Separate log processing by the repository. One queue for the following. One queue for everything else:
    autoland
    mozilla-central
    mozilla-beta
    mozilla-release
    mozilla-esr78

Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser queue alone. It doesn't appear to be hitting any issues.

There is some technical debt in this area. But I will outline here the steps that would be taken to implement this.

  1. The main technical debt that is most likely just code that can be ripped out is https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L80-L95. And all the functions that this code calls that are not used elsewhere. We would also remove the two queues of log_autoclassify and log_autoclassify_fail.
  2. This is where we decide whether to handle the logs for a task in the fail queue, or the normal one: https://github.com/mozilla/treeherder/blob/d251842d30b61dc41070eb2b5f14bd8f18fb2bc1/treeherder/etl/jobs.py#L337-L344
  3. We should add logic to the != 'success' section to split it between the raw log and the errorsummary.log. I propose creating two new queues of log_parser_fail_raw and log_parser_fail_json (but other names would work).
  4. We need to modify parse_logs here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to have parse_logs no longer be an async task. The logic in there decides what parser function to use for the raw or the json (errorsummary) log. I think changing this so that the parsers are now async tasks that get called, placing the task in the appropriate queue.

Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts and I think also in the parser itself) but not for the error summary logs.

Edit: but otherwise, that proposal sounds good :)

Following on comment 12, I forgot to add a few more steps:

  1. This file is where we declare the worker (dyno) allocation for the new queues: https://github.com/mozilla/treeherder/blob/7587fb224e800fa7ecab7dd39ac5db43bab33ea0/Procfile#L46-L47 . So lines will need to be added here to enable handling of the new queues.
    a. Once a PR has been deployed, the worker will be created, but not yet provisioned. You'll need to go into Heroku (https://dashboard.heroku.com/apps/treeherder-prototype/resources) and set the type and number of dynos.
    b. Also here: https://github.com/mozilla/treeherder/blob/22d854be68a3ef5884aacf1424b27a52824c2fbd/app.json#L78
  2. The new queues also need to be specified here: https://github.com/mozilla/treeherder/blob/1d8db11414501096572fb55e1f5f7b261a067b2a/treeherder/config/settings.py#L328-L329
  3. Update documentation line 62 here: https://github.com/mozilla/treeherder/blob/c9b86aa40305b7f9e607129fd833c27cbac36b15/docs/pulseload.md#L62
Assignee: nobody → aki

WIP: https://github.com/mozilla/treeherder/pull/6891

(In reply to Jesse Schwartzentruber (:truber) from comment #17)

(In reply to Aki Sasaki [:aki] (he/him) from comment #16)

How do we currently schedule fuzzing tasks?
(In reply to Cameron Dawson [:camd] from comment #12)
In the post-mortem, we discussed two tasks to mitigate the impact of memory growth due to large log lines. We already have two queues: log_parser and log_parser_fail which split according to the result of a task. But we want to split the log_parser_fail down further.

  1. Separate the processing of the raw log and the _errorsummary.log to separate queues and workers

Your instructions seem to cover this very well.

  1. Separate log processing by the repository. One queue for the following. One queue for everything else:
    autoland
    mozilla-central
    mozilla-beta
    mozilla-release
    mozilla-esr78

Your instructions don't seem to cover this. Is there an easy way to tell which project/repo we're on? That means instead of 3 queues we'll have 6, yes? Or are we not doing this?

Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser queue alone. It doesn't appear to be hitting any issues.

4 separate fail queues? or 4 total queues? I count 3 or 6.

There is some technical debt in this area. But I will outline here the steps that would be taken to implement this.

  1. The main technical debt that is most likely just code that can be ripped out is https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L80-L95. And all the functions that this code calls that are not used elsewhere. We would also remove the two queues of log_autoclassify and log_autoclassify_fail.

Done.

  1. This is where we decide whether to handle the logs for a task in the fail queue, or the normal one: https://github.com/mozilla/treeherder/blob/d251842d30b61dc41070eb2b5f14bd8f18fb2bc1/treeherder/etl/jobs.py#L337-L344
  2. We should add logic to the != 'success' section to split it between the raw log and the errorsummary.log. I propose creating two new queues of log_parser_fail_raw and log_parser_fail_json (but other names would work).

Done. At this point we have 3 queues, yes?

  1. We need to modify parse_logs here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to have parse_logs no longer be an async task. The logic in there decides what parser function to use for the raw or the json (errorsummary) log. I think changing this so that the parsers are now async tasks that get called, placing the task in the appropriate queue.

I'm not understanding this 100% but I added some guesses to the PR.

(In reply to Sarah Clements [:sclements] from comment #13)

Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts and I think also in the parser itself) but not for the error summary logs.

I'm not finding where I'd hook this in... in parse_logs ?

(In reply to Cameron Dawson [:camd] from comment #14)

Following on comment 12, I forgot to add a few more steps:

  1. This file is where we declare the worker (dyno) allocation for the new queues: https://github.com/mozilla/treeherder/blob/7587fb224e800fa7ecab7dd39ac5db43bab33ea0/Procfile#L46-L47 . So lines will need to be added here to enable handling of the new queues.

Done.

a. Once a PR has been deployed, the worker will be created, but not yet provisioned. You'll need to go into Heroku (https://dashboard.heroku.com/apps/treeherder-prototype/resources) and set the type and number of dynos.
b. Also here: https://github.com/mozilla/treeherder/blob/22d854be68a3ef5884aacf1424b27a52824c2fbd/app.json#L78

I did (b). Not (a) yet.

  1. The new queues also need to be specified here: https://github.com/mozilla/treeherder/blob/1d8db11414501096572fb55e1f5f7b261a067b2a/treeherder/config/settings.py#L328-L329

Done. 3 queues?

  1. Update documentation line 62 here: https://github.com/mozilla/treeherder/blob/c9b86aa40305b7f9e607129fd833c27cbac36b15/docs/pulseload.md#L62

Waiting til I'm closer to done.

Flags: needinfo?(sclements)
Flags: needinfo?(cdawson)

(In reply to Aki Sasaki [:aki] (he/him) from comment #16)

Your instructions don't seem to cover this. Is there an easy way to tell which project/repo we're on? That means instead of 3 queues we'll have 6, yes? Or are we not doing this?

You can get the repository from the job object, via the ORM. But that might require adding some select_related fields when getting the job object. I think a better way would be to pass in the repository from the jobs._load_job function into the call to _schedule_log_parsing at line 303 of that file.

Obviously, this split will mean 4 separate queues altogether. We will leave the log_parser queue alone. It doesn't appear to be hitting any issues.

4 separate fail queues? or 4 total queues? I count 3 or 6.

I was thinking just about the fail queues:

  1. log_parser_fail_raw
  2. log_parser_fail_json

or

  1. log_parser_fail_raw_sheriffed
  2. log_parser_fail_raw_unsheriffed
  3. log_parser_fail_json_sheriffed
  4. log_parser_fail_json_unsheriffed
  1. We should add logic to the != 'success' section to split it between the raw log and the errorsummary.log. I propose creating two new queues of log_parser_fail_raw and log_parser_fail_json (but other names would work).

Done. At this point we have 3 queues, yes?
Including the non-fail log_parser, 3 yes.

  1. We need to modify parse_logs here: https://github.com/mozilla/treeherder/blob/507881a7b8b1062ac03ac733fde43cc04d3fccfc/treeherder/log_parser/tasks.py#L24 . My thought is to have parse_logs no longer be an async task. The logic in there decides what parser function to use for the raw or the json (errorsummary) log. I think changing this so that the parsers are now async tasks that get called, placing the task in the appropriate queue.

I'm not understanding this 100% but I added some guesses to the PR.

Yeah, my apologies, I was having a hard time figuring out how to explain this in the bug without getting overly verbose. :).

Flags: needinfo?(cdawson)

(In reply to Aki Sasaki [:aki] (he/him) from comment #16)

(In reply to Sarah Clements [:sclements] from comment #13)

Another thing that was mentioned is either putting a cap on error summary logs - so skipping them if they are too large and/or limiting the size of the line we store in the tables. We do that now for the raw summary/live backing logs (in post_log_artifacts and I think also in the parser itself) but not for the error summary logs.

I'm not finding where I'd hook this in... in parse_logs ?

I think this would need to be in two places:

  1. check for the size in store_failure_lines here: https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/tasks.py#L34 (so doing the same as what post_log_artifacts is doing)
  2. Truncating the size stored in write_failure_lines - could do the same as what the perfherder parser is doing (for the raw logs): https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/artifactbuilders.py#L46

Although, with 1 if we start skipping more logs - raw or error summaries - because they have suddenly grown too big then we'll be missing a lot more data. I think a question that hasn't been asked is why some of these logs have suddenly gotten bigger and if it's necessary for them to be that large. Not sure who we could ask about that.

Flags: needinfo?(sclements)

Here are some questions answered by Sebastian Hengst (Aryx) asked by Sarah Clements:

When the log parsing issue hits, what is the level of urgency?

  • Try should be taking tasks all the time if possible: some changes require iterative testing on Try
  • outside working hours (Fri 6pm PT - Sun 11am PT): We can live with autoland closed during these hours. The issue is the Monday on which a closed tree prevents landing patches for APAC and Europe. They can request the landing but there might be conflicts between patches and a big queue of patches can cause second level effects (e.g. bug 1678836 hours of waiting for tasks starting possible if many pushes in a short time).
  • When Firefox needs to do an emergency release ("chemspill") for a security vulnerability exploited in the wild, the plan is to have the time from the report to shipping the fix in less than 24h. (The "addongeddon" and the Firefox 82 dot release last month happened on a weekend). RelMan knows if such a release is being worked on and as the problem as to be analyzed and the patch has to be developed first, there would be some time to fix a Treeherder issue (if it has been determined what the TH issue is before and we can estimate how long a fix would take).

What does it mean for the log parser issue:

  • outside working hours (Fri 6pm PT - Sun 11am PT): We can live with it but need it fixed when New Zealand starts to work.

Is there a workaround (looking at raw logs when failure lines aren't available, for example)?

  • No, there are too many failing tasks and too many rules to do this manually.

What trees are the highest priority (especially if it happens during a week where all of the moco staff are not working and people shouldn't be pushing code)?

  • mozilla-release and mozilla-esr have highest priority, beta should also be regarded high, followed by autoland and central. Having Try closed is causing a productivity loss across the Firefox Org, log parsing delays up to 2 hours for it are acceptable.
  • There are some groups which work during MoCo offtime: development contractors (e.g. Igalia - web platform features, SoftVision - Firefox for Android), performance sheriffs (in some cases they need Try to check if a change caused a regression), people working on their goals. For comparison, last week had ~30% of the push volume of usual weeks.
See Also: → 1594822

Initial stab at max error size in https://github.com/mozilla/treeherder/pull/6895 . I'm not sure what limits we should set to avoid dropping valid errors, while protecting the service from bloated error logs.

Blocks: 1681172

We rolled out the queue split today. I filed bug 1681172 for following up on the max error size, if we decide to do that.
I also have privs to resize dynos.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(mtabara)
Resolution: --- → FIXED
No longer regressions: 1707635
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.