Open Bug 1763795 Opened 2 years ago Updated 2 years ago

Log parsing backlogged, logs waiting to get parsed

Tracking

(Not tracked)

Status:

REOPENED

People

(Reporter: NarcisB, Unassigned)

Details

Attachments

(2 files)

Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/7464 2 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/7467 2 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review

Narcis Beleuzu [:NarcisB]

Reporter

Description

•

2 years ago

We're having a big number of failures which are having only their raw logs available.

Failures appear as having log parsing incomplete:
e.g. push -> https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=YxoXv0xUT5y-dOC7noK80g.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=778d89a3d0f04691dd19616ab3288eda7b4f41be

Closed autoland and raised issue on Slack -> #Treeherder and #FirefoxCI

Flags: needinfo?(kkleemola)

Flags: needinfo?(bwells)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 1

•

2 years ago

Log parser has processed the backlog.

Status: NEW → RESOLVED

Closed: 2 years ago

Flags: needinfo?(kkleemola)

Flags: needinfo?(bwells)

Resolution: --- → INCOMPLETE

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 2

•

2 years ago

We have repeatedly seen the parsing of logs getting backlogged - last Friday for 90 minutes. The root cause is unknown.

There is a general slowness of SQL queries these days:

This fenix failure is the result of the bugscache queries for the failure lines timing out. The failure lines and bug suggestions succeed to load on staging.
This query for machines in CI which permanently fail their tasks takes 11-14 minutes when queried for the last 2 days. This was recently around 2 minutes. Bugscache is not part of the query.
Log parsing writes the parsing results to the database but also queries the bug suggestions for each failure line to warm the cache. During Friday's log parsing slowness, NewRelic logged some SQL queries as taking >5 minutes. Parsing backlog can be explained by this.

The root cause is still unknown.

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Summary: Log parsing is not working, shows as Log parsing not complete → Log parsing backlogged, logs waiting to get parsed

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 3

•

2 years ago

We hit the issue again and GCP shows a MySQL storage increase by 50% (choose 'Storage usage' from the dropdown). Katie, could investigate what is happening here, e.g. compare the sizes of the tables with the ones in a backup? Thank you in advance.

Flags: needinfo?(kkleemola)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 4

•

2 years ago

•

Edited

We are again encountering the issue with logs for unsheriffed repositories (e.g. 'try'). This query shows the database knows about the latest tasks and the logs are set as parsed in the database. The db increased its storage usage by ~500GB in the last 2 hours. Another such 2 moves and we hit the limit for the database size as defined in GCP.

Katie K

Comment 5

•

2 years ago

I killed about 300 queries that had been running over an hour which stabilized the db.

The query in question is:

SELECT `job`.`submit_time`, `job`.`start_time`, `job`.`end_time`, `job`.`failure_classification_id`, `job`.`id`, `job_group`.`name`, `job_group`.`symbol`, `job_type`.`name`, `job_type`.`symbol`, `job`.`last_modified`, `job`.`option_collection_hash`, `machine_platform`.`platform`, `job`.`option_collection_hash`, `job`.`push_id`, `push`.`revision`, `job`.`result`, `reference_data_signatures`.`signature`, `job`.`state`, `job`.`tier`, `taskcluster_metadata`.`task_id`, `taskcluster_metadata`.`retry_id` FROM `job` INNER JOIN `job_group` ON (`job`.`job_group_id` = `job_group`.`id`) INNER JOIN `job_type` ON (`job`.`job_type_id` = `job_type`.`id`) INNER JOIN `machine_platform` ON (`job`.`machine_platform_id` = `machine_platform`.`id`) INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) INNER JOIN `reference_data_signatures` ON (`job`.`signature_id` = `reference_data_signatures`.`id`) LEFT OUTER JOIN `taskcluster_metadata` ON (`job`.`id` = `taskcluster_metadata`.`job_id`) ORDER BY `job`.`id` ASC

Results from explain are: Using index; Using temporary; Using filesort

These queries are likely the size of the storage increase as the increase is due to temp data. This is visible in cloud monitoring if you use the following metric: 'cloudsql.googleapis.com/database/disk/bytes_used_by_data_type'

The limit for the database is set to grow dynamically so if we hit it, this isn't a failure scenario. The database itself is so large likely because sometimes the temp data grows by large amounts.

DB CPU has been spiking more frequently since early April. I'm not sure if there's any changes that happened around then.
We do keep some metrics in influxdb for longer than the period gcp gives us in cloud monitoring.
https://earthangel-b40313e5.influxcloud.net/d/rBBO6ajMz/treeherder?orgId=1&viewPanel=17&from=now-6M&to=now

:aryx noted on slack that the above query should be a web query: https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L122 and that there should be a 30 second limit.

I set the db flag max_execution_time to automatically kill queries that last longer than 10 minutes as a band aid solution.

Flags: needinfo?(kkleemola)

Cosmin Sabou [:CosminS]

Comment 6

•

2 years ago

•

Edited

This is hitting autoland again, tree is closed.
Backlog is around 20-25 mins for now and rising.
LE: 1 hr and still going up.

Cosmin Sabou [:CosminS]

Comment 7

•

2 years ago

Backlog is gone and reopened the tree.

Katie K

Comment 8

•

2 years ago

I reduced max_execution_time to 2 minutes around 12:30 ET today and the db CPU doesn't seem to be getting stuck at 100% anymore.

Katie K

Comment 9

•

2 years ago

Log parsing was delayed again today and autoland was closed (2022-05-10T14:38:14.417230+00:00 from tree status - reported also to #treeherder-ops).

The database appears to be normal during this period and has been stable since I reduced max_execution_time yesterday so if we're continuing to see issues it's quite possible there's multiple problems.

(by stabilized db i mean: CPU is not consistently at 100% and we're not seeing high disk IO and overall query slowness described above).

There's also bug 1746176 in which :cvalaas added additional replicas to the deployment for the unsheriffed parsers.
If we're seeing issues more often with just autoland now, I can try adding additional replicas to the sheriffed parsers similar to what was done before.

:aryx and :jmaher - I think you talked to :cvalaas when he made the previous changes (i.e. increasing replicas for the log parsers and the unsheriffed parsers), any reason it wouldn't make sense to make the same change for the sheriffed?

Flags: needinfo?(jmaher)

Flags: needinfo?(aryx.bugmail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

2 years ago

I think it is ok to increase the number of log parsers. if there are no errors that seem to be the cause of this, +1

Flags: needinfo?(jmaher)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 11

•

2 years ago

We can try this. There is only an elevated level of logs parsed between 12:00pm and 12:30pm UTC (likely from Nighly builds which run all builds and tests), afterwards the volume drops more than half.

The log gets set as parsed here, after the log has been parsed. Querying for unparsed logs returns very few logs:

select repository.name as repository_name,
       job_log.id
from repository
join push
on repository.id = push.repository_id
join job
on push.id = job.push_id
join job_log
on job.id = job_log.job_id
where job.end_time > '2022-05-09' and
      job_log.status = 0 and
      repository.name in ('autoland', 'try')

Either the issue is related to storing failure lines or a second attempt to parse the log and store it succeeds.

Flags: needinfo?(aryx.bugmail)

Cristian Tuns

Comment 12

•

2 years ago

Hello, this is happening again on autoland, central. Link
Aprox 10-15 min now.

Katie K

Comment 13

•

2 years ago

OK - I made the change manually at 11:20am UTC today and have a PR merged such that this change shouldn't be reverted next deploy.

Bastien Abadie [:bastien]

Comment 14

•

2 years ago

As mentionned in Comment 2 above, a good suspect is get_error_summary and the log insertion code itself.

Here is a list of possible changes to improve db performance (basically make less queries):

TestLogError insertions in store_text_log_summary_artifact should use Django's bulk_create with the ignore_conflicts option, relying on DB constraints. This would remove all the get_or_create hits, grouping all insertions into a single efficient call (also we would gain transaction out of the box)
At the end of store_text_log_summary_artifact, the get_error_sumarry method is called, and lists all the TextLogError that were just created. We could provide them to the method (or create another one) to populate the cache without querying again the database
check TestLogError.job is correctly indexed, as it's used a lot for joints

GitHub Bugzilla PR Linker

Comment 15

•

2 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/7464 — Details

GitHub Bugzilla PR Linker

Comment 16

•

2 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/7467 — Details

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Log parsing backlogged, logs waiting to get parsed

Categories

(Tree Management :: Treeherder, defect)

Tracking

(Not tracked)

People

(Reporter: NarcisB, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type