Open Bug 1717128 Opened 2 years ago Updated 1 year ago

Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster

Categories

(Tree Management :: Treeherder: Data Ingestion, defect)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: aryx, Unassigned)

Details

Attachments

(3 files)

Code sheriffs have observed Treeherder lagging the display of Taskcluster task data at least 3 times recently. 2 of them got observed around 10pm UTC when the Nightlies start and tasks showed up with 8-15 minutes delay which renders it impossible to cancel cron tasks (e.g. the ones which schedule new Nightlies).

Is there any tool in GCP which allows to evaluate the state of the backlog and to which I have access?

Flags: needinfo?(oremj)
Summary: Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster → [Trees closed] Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster

Situation recovered after Jeremy restarted the database and workers and added memory.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Flags: needinfo?(oremj)

Yesterday (Monday) at ~22:30 UTC the delay in task ingestion got observed again, confusing people who pushed to Try because it seemed like no tasks were getting scheduled.

Does the data ingestion pipeline fall back with the load? If yes, could we get insight into what gets ingested? We had the Beta and Nightly builds running, the most load heavy tasks were the performance ones in the past but I didn't identify a cron task scheduling such before the delay started.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

RyanVM reports ingestting is delayed again. It will likely catch up but eventually this needs investigation.

Flags: needinfo?(cvalaas)
Summary: [Trees closed] Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster → Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster

I looked at all the things I could think of (which I'm sure is missing some pieces). Today (Monday, Aug 16):

  • CloudAMQP for treeherder-prod had a spike in messages between 18:36 UTC to a high of almost 10,000 "ready" messages by 19:08, then fell back down to zero by 20:13 UTC
  • One of the kubernetes clusters' nodes spiked up to 100% CPU from 19:31 UTC to 20:11 UTC
    • It looks like the pod in question was treeherder-log-parser
  • The MySQL database for treeherder prod didn't see any changes in CPU usage for this time period

So I guess some questions are:

  1. Did treeherder "catch-up" by 20:11 UTC this time?
  2. Does anyone know what the chain-of-systems is here? What components should I be looking at for bottlenecks?
Flags: needinfo?(cvalaas)
  1. Did treeherder "catch-up" by 20:11 UTC this time?

Based on the information: Yes.

  1. Does anyone know what the chain-of-systems is here? What components should I be looking at for bottlenecks?

Treeherder gets the pulse notification and seems busy handling the log parsing. We are likely looking for log artifacts of jobs which ended at 18:36 UTC or <5 minutes before and are big (uncompressed size) and keep the log parser busy (bottleneck reading the log). The 100% later could be the decompression of logs of normal size (much smaller than the ones causing the delay).

I can't see any obvious problems in the treeherder-prod logs, but there're thousands of log entries per minute, so it's tough to narrow down.
I'd be happy to give you access to the logs (or whatever else) if that would help in tracking this down ... ?

I added you as a "Logs Viewer" for the treeherder folder, which should let you see treeherder-prod logs, as well as treeherder-nonprod, which could also help with troubleshooting https://bugzilla.mozilla.org/show_bug.cgi?id=1429030

This affects TH production again. Switching to any measurement for SQL performance except the default "CPU utilization" at https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1/overview?authuser=0&project=moz-fx-treeherder-prod-c739&supportedpurview=project doesn't work for me (at least the pages don't require always a reload this time). Chris, does this need additional permissions?

It shouldn't require any additional perms. You have "viewer" for the entire project.

I looked at the MySQL Queries graph and it shows a spike from about 600 queries/s (baseline for the last hour) to 1,865 queries/s at 3:43pm (Pacific time), then back to baseline 1 minute later.
The other graphs don't show any bump at that time.

A backup started 5 minutes after that time (at 3:48p pacific), so that doesn't seem related.

And mysql error logs show nothing for the last hour.

Additionally, no slow queries (longer than 1 second to complete) were logged in the slow query log during that time.

The issue we are looking for started at ~3:00pm PDT or earlier based on the delay with which the tasks are showing up on Treeherder.

Only 2 slow queries were logged between 2p and 4p (pacific). Both took between 1 and 2 seconds to complete, so that's probably not it.
Between 2:53p and 3:02p there was a spike in MySQL egress traffic that seems slightly anomalous, but that doesn't seem too helpful either.
I don't see any other weird spikes over the last 6 hours.

Could the egress traffic cause a backlog in writes?

What looks unexpected is the fact the Log Explorer lists received push message from messages which are logged with logger.info while the received job message from messages are missing (source). Any idea what causes the difference?

I'm just guessing here, but the received job message from log entries are at the debug level, so probably the overall app configuration is set to not emit those. I see the received push message from entries are at the info level, and even then they're just being emitted to stderr.

This has been observed again, code sheriffs reached out at 18:09 UTC (20:09 in the screenshot) for pushes with an age of 10 minutes having had no decision task scheduled. The prototype instance was also affected which means the delay should not be from requests against the instance. The screenshot contains the count of notifications about jobs ("tasks") processed by Treeherder (search term is job message) on the non-prod instance which seems to log differently (no slow SQL logging but logs logger.debug instead?). The increase stems mostly from the CI to release candidates for Firefox 92.0, Firefox 78.14.0 and likely also mobile releases. Most log messages mentions signing related tasks (signing, beetmover).

Chris, could you log the size of the pulse queue which will tell us if the output from Taskcluster outpaces Treeherder ingestion or if it is an issue with Taskcluster's exchange with Pulse or Pulse itself?

Should be loop e.g. jwhitlock from Taskcluster in and try to get data if the count or size of logs to parse increased when this hit?

The maximum delay was ~30 minutes this time, Treeherder caught up ~19:15 UTC. Production trees were closed for an hour.

The time window for the incidence on 2021-08-23 doesn't show this peak.

Flags: needinfo?(cvalaas)
Flags: needinfo?(cvalaas)

Right around 17:16 UTC on Sept 1, I see a huge spike in messages to the treeherder-prod and treeherder-prototype queues. In the last seven days the average is around maybe 500 messages, but at that time the number of messages increases to 25,000 (prod) and 30,000 (prototype).

Seems like chewing through that queue would certainly cause delays. If those are all legit messages, then perhaps whatever is supposed to chew through those messages needs to be beefed-up? Also, looping in :jwhitlock is never a bad idea, except as it impacts his sanity.

You need to log in before you can comment on or make changes to this bug.