Open Bug 1763795 Opened 2 years ago Updated 2 years ago

Log parsing backlogged, logs waiting to get parsed

Categories

(Tree Management :: Treeherder, defect)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: NarcisB, Unassigned)

Details

Attachments

(2 files)

We're having a big number of failures which are having only their raw logs available.

Failures appear as having log parsing incomplete:
e.g. push -> https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=YxoXv0xUT5y-dOC7noK80g.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=778d89a3d0f04691dd19616ab3288eda7b4f41be

Closed autoland and raised issue on Slack -> #Treeherder and #FirefoxCI

Flags: needinfo?(kkleemola)
Flags: needinfo?(bwells)

Log parser has processed the backlog.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(kkleemola)
Flags: needinfo?(bwells)
Resolution: --- → INCOMPLETE

We have repeatedly seen the parsing of logs getting backlogged - last Friday for 90 minutes. The root cause is unknown.

There is a general slowness of SQL queries these days:

The root cause is still unknown.

Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Summary: Log parsing is not working, shows as Log parsing not complete → Log parsing backlogged, logs waiting to get parsed

We hit the issue again and GCP shows a MySQL storage increase by 50% (choose 'Storage usage' from the dropdown). Katie, could investigate what is happening here, e.g. compare the sizes of the tables with the ones in a backup? Thank you in advance.

Flags: needinfo?(kkleemola)

We are again encountering the issue with logs for unsheriffed repositories (e.g. 'try'). This query shows the database knows about the latest tasks and the logs are set as parsed in the database. The db increased its storage usage by ~500GB in the last 2 hours. Another such 2 moves and we hit the limit for the database size as defined in GCP.

I killed about 300 queries that had been running over an hour which stabilized the db.

The query in question is:

SELECT `job`.`submit_time`, `job`.`start_time`, `job`.`end_time`, `job`.`failure_classification_id`, `job`.`id`, `job_group`.`name`, `job_group`.`symbol`, `job_type`.`name`, `job_type`.`symbol`, `job`.`last_modified`, `job`.`option_collection_hash`, `machine_platform`.`platform`, `job`.`option_collection_hash`, `job`.`push_id`, `push`.`revision`, `job`.`result`, `reference_data_signatures`.`signature`, `job`.`state`, `job`.`tier`, `taskcluster_metadata`.`task_id`, `taskcluster_metadata`.`retry_id` FROM `job` INNER JOIN `job_group` ON (`job`.`job_group_id` = `job_group`.`id`) INNER JOIN `job_type` ON (`job`.`job_type_id` = `job_type`.`id`) INNER JOIN `machine_platform` ON (`job`.`machine_platform_id` = `machine_platform`.`id`) INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) INNER JOIN `reference_data_signatures` ON (`job`.`signature_id` = `reference_data_signatures`.`id`) LEFT OUTER JOIN `taskcluster_metadata` ON (`job`.`id` = `taskcluster_metadata`.`job_id`) ORDER BY `job`.`id` ASC

Results from explain are: Using index; Using temporary; Using filesort

These queries are likely the size of the storage increase as the increase is due to temp data. This is visible in cloud monitoring if you use the following metric: 'cloudsql.googleapis.com/database/disk/bytes_used_by_data_type'

The limit for the database is set to grow dynamically so if we hit it, this isn't a failure scenario. The database itself is so large likely because sometimes the temp data grows by large amounts.

DB CPU has been spiking more frequently since early April. I'm not sure if there's any changes that happened around then.
We do keep some metrics in influxdb for longer than the period gcp gives us in cloud monitoring.
https://earthangel-b40313e5.influxcloud.net/d/rBBO6ajMz/treeherder?orgId=1&viewPanel=17&from=now-6M&to=now

:aryx noted on slack that the above query should be a web query: https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L122 and that there should be a 30 second limit.

I set the db flag max_execution_time to automatically kill queries that last longer than 10 minutes as a band aid solution.

Flags: needinfo?(kkleemola)

This is hitting autoland again, tree is closed.
Backlog is around 20-25 mins for now and rising.
LE: 1 hr and still going up.

Backlog is gone and reopened the tree.

I reduced max_execution_time to 2 minutes around 12:30 ET today and the db CPU doesn't seem to be getting stuck at 100% anymore.

Log parsing was delayed again today and autoland was closed (2022-05-10T14:38:14.417230+00:00 from tree status - reported also to #treeherder-ops).

The database appears to be normal during this period and has been stable since I reduced max_execution_time yesterday so if we're continuing to see issues it's quite possible there's multiple problems.

(by stabilized db i mean: CPU is not consistently at 100% and we're not seeing high disk IO and overall query slowness described above).

There's also bug 1746176 in which :cvalaas added additional replicas to the deployment for the unsheriffed parsers.
If we're seeing issues more often with just autoland now, I can try adding additional replicas to the sheriffed parsers similar to what was done before.

:aryx and :jmaher - I think you talked to :cvalaas when he made the previous changes (i.e. increasing replicas for the log parsers and the unsheriffed parsers), any reason it wouldn't make sense to make the same change for the sheriffed?

Flags: needinfo?(jmaher)
Flags: needinfo?(aryx.bugmail)

I think it is ok to increase the number of log parsers. if there are no errors that seem to be the cause of this, +1

Flags: needinfo?(jmaher)

We can try this. There is only an elevated level of logs parsed between 12:00pm and 12:30pm UTC (likely from Nighly builds which run all builds and tests), afterwards the volume drops more than half.

The log gets set as parsed here, after the log has been parsed. Querying for unparsed logs returns very few logs:

select repository.name as repository_name,
       job_log.id
from repository
join push
on repository.id = push.repository_id
join job
on push.id = job.push_id
join job_log
on job.id = job_log.job_id
where job.end_time > '2022-05-09' and
      job_log.status = 0 and
      repository.name in ('autoland', 'try')

Either the issue is related to storing failure lines or a second attempt to parse the log and store it succeeds.

Flags: needinfo?(aryx.bugmail)

Hello, this is happening again on autoland, central. Link
Aprox 10-15 min now.

OK - I made the change manually at 11:20am UTC today and have a PR merged such that this change shouldn't be reverted next deploy.

As mentionned in Comment 2 above, a good suspect is get_error_summary and the log insertion code itself.

Here is a list of possible changes to improve db performance (basically make less queries):

  1. TestLogError insertions in store_text_log_summary_artifact should use Django's bulk_create with the ignore_conflicts option, relying on DB constraints. This would remove all the get_or_create hits, grouping all insertions into a single efficient call (also we would gain transaction out of the box)
  2. At the end of store_text_log_summary_artifact, the get_error_sumarry method is called, and lists all the TextLogError that were just created. We could provide them to the method (or create another one) to populate the cache without querying again the database
  3. check TestLogError.job is correctly indexed, as it's used a lot for joints
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: