Open Bug 1717128 Opened 4 years ago Updated 4 years ago

Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster

Tracking

(Not tracked)

Status:

REOPENED

People

(Reporter: aryx, Unassigned)

Details

Attachments

(3 files)

screenshot job messages per minute non-prod treeherder 2021-09-01 (timestamps UTC+2) 4 years ago Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) 33.63 KB, image/png		Details
Prod messages last 7 days.png 4 years ago chris valaas [:cvalaas] 32.48 KB, image/png		Details
Prod message last 24 hrs.png 4 years ago chris valaas [:cvalaas] 39.17 KB, image/png		Details

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

4 years ago

Code sheriffs have observed Treeherder lagging the display of Taskcluster task data at least 3 times recently. 2 of them got observed around 10pm UTC when the Nightlies start and tasks showed up with 8-15 minutes delay which renders it impossible to cancel cron tasks (e.g. the ones which schedule new Nightlies).

Is there any tool in GCP which allows to evaluate the state of the backlog and to which I have access?

Flags: needinfo?(oremj)

Alexandru Michis [:malexandru]

Updated

•

4 years ago

Summary: Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster → [Trees closed] Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster

Alexandru Michis [:malexandru]

Comment 1

•

4 years ago

Situation recovered after Jeremy restarted the database and workers and added memory.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Jeremy Orem [:oremj]

Updated

•

4 years ago

Flags: needinfo?(oremj)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 2

•

4 years ago

Yesterday (Monday) at ~22:30 UTC the delay in task ingestion got observed again, confusing people who pushed to Try because it seemed like no tasks were getting scheduled.

Does the data ingestion pipeline fall back with the load? If yes, could we get insight into what gets ingested? We had the Beta and Nightly builds running, the most load heavy tasks were the performance ones in the past but I didn't identify a cron task scheduling such before the delay started.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 3

•

4 years ago

RyanVM reports ingestting is delayed again. It will likely catch up but eventually this needs investigation.

Flags: needinfo?(cvalaas)

Summary: [Trees closed] Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster → Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster

chris valaas [:cvalaas]

Comment 4

•

4 years ago

I looked at all the things I could think of (which I'm sure is missing some pieces). Today (Monday, Aug 16):

CloudAMQP for treeherder-prod had a spike in messages between 18:36 UTC to a high of almost 10,000 "ready" messages by 19:08, then fell back down to zero by 20:13 UTC
One of the kubernetes clusters' nodes spiked up to 100% CPU from 19:31 UTC to 20:11 UTC
- It looks like the pod in question was treeherder-log-parser
The MySQL database for treeherder prod didn't see any changes in CPU usage for this time period

So I guess some questions are:

Did treeherder "catch-up" by 20:11 UTC this time?
Does anyone know what the chain-of-systems is here? What components should I be looking at for bottlenecks?

Flags: needinfo?(cvalaas)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 5

•

4 years ago

Did treeherder "catch-up" by 20:11 UTC this time?

Based on the information: Yes.

Does anyone know what the chain-of-systems is here? What components should I be looking at for bottlenecks?

Treeherder gets the pulse notification and seems busy handling the log parsing. We are likely looking for log artifacts of jobs which ended at 18:36 UTC or <5 minutes before and are big (uncompressed size) and keep the log parser busy (bottleneck reading the log). The 100% later could be the decompression of logs of normal size (much smaller than the ones causing the delay).

chris valaas [:cvalaas]

Comment 6

•

4 years ago

I can't see any obvious problems in the treeherder-prod logs, but there're thousands of log entries per minute, so it's tough to narrow down.
I'd be happy to give you access to the logs (or whatever else) if that would help in tracking this down ... ?

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 7

•

4 years ago

Thank you, please grant the access to shengst@m.c

chris valaas [:cvalaas]

Comment 8

•

4 years ago

I added you as a "Logs Viewer" for the treeherder folder, which should let you see treeherder-prod logs, as well as treeherder-nonprod, which could also help with troubleshooting https://bugzilla.mozilla.org/show_bug.cgi?id=1429030

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 9

•

4 years ago

This affects TH production again. Switching to any measurement for SQL performance except the default "CPU utilization" at https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1/overview?authuser=0&project=moz-fx-treeherder-prod-c739&supportedpurview=project doesn't work for me (at least the pages don't require always a reload this time). Chris, does this need additional permissions?

chris valaas [:cvalaas]

Comment 10

•

4 years ago

•

Edited

It shouldn't require any additional perms. You have "viewer" for the entire project.

I looked at the MySQL Queries graph and it shows a spike from about 600 queries/s (baseline for the last hour) to 1,865 queries/s at 3:43pm (Pacific time), then back to baseline 1 minute later.
The other graphs don't show any bump at that time.

A backup started 5 minutes after that time (at 3:48p pacific), so that doesn't seem related.

And mysql error logs show nothing for the last hour.

chris valaas [:cvalaas]

Comment 11

•

4 years ago

•

Edited

Additionally, no slow queries (longer than 1 second to complete) were logged in the slow query log during that time.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 12

•

4 years ago

The issue we are looking for started at ~3:00pm PDT or earlier based on the delay with which the tasks are showing up on Treeherder.

chris valaas [:cvalaas]

Comment 13

•

4 years ago

Only 2 slow queries were logged between 2p and 4p (pacific). Both took between 1 and 2 seconds to complete, so that's probably not it.
Between 2:53p and 3:02p there was a spike in MySQL egress traffic that seems slightly anomalous, but that doesn't seem too helpful either.
I don't see any other weird spikes over the last 6 hours.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 14

•

4 years ago

Could the egress traffic cause a backlog in writes?

What looks unexpected is the fact the Log Explorer lists received push message from messages which are logged with logger.info while the received job message from messages are missing (source). Any idea what causes the difference?

chris valaas [:cvalaas]

Comment 15

•

4 years ago

I'm just guessing here, but the received job message from log entries are at the debug level, so probably the overall app configuration is set to not emit those. I see the received push message from entries are at the info level, and even then they're just being emitted to stderr.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 16

•

4 years ago

Attached image screenshot job messages per minute non-prod treeherder 2021-09-01 (timestamps UTC+2) — Details

This has been observed again, code sheriffs reached out at 18:09 UTC (20:09 in the screenshot) for pushes with an age of 10 minutes having had no decision task scheduled. The prototype instance was also affected which means the delay should not be from requests against the instance. The screenshot contains the count of notifications about jobs ("tasks") processed by Treeherder (search term is job message) on the non-prod instance which seems to log differently (no slow SQL logging but logs logger.debug instead?). The increase stems mostly from the CI to release candidates for Firefox 92.0, Firefox 78.14.0 and likely also mobile releases. Most log messages mentions signing related tasks (signing, beetmover).

Chris, could you log the size of the pulse queue which will tell us if the output from Taskcluster outpaces Treeherder ingestion or if it is an issue with Taskcluster's exchange with Pulse or Pulse itself?

Should be loop e.g. jwhitlock from Taskcluster in and try to get data if the count or size of logs to parse increased when this hit?

The maximum delay was ~30 minutes this time, Treeherder caught up ~19:15 UTC. Production trees were closed for an hour.

The time window for the incidence on 2021-08-23 doesn't show this peak.

Flags: needinfo?(cvalaas)

chris valaas [:cvalaas]

Comment 17

•

4 years ago

Attached image Prod messages last 7 days.png — Details

Flags: needinfo?(cvalaas)

chris valaas [:cvalaas]

Comment 18

•

4 years ago

Attached image Prod message last 24 hrs.png — Details

Right around 17:16 UTC on Sept 1, I see a huge spike in messages to the treeherder-prod and treeherder-prototype queues. In the last seven days the average is around maybe 500 messages, but at that time the number of messages increases to 25,000 (prod) and 30,000 (prototype).

chris valaas [:cvalaas]

Comment 19

•

4 years ago

Seems like chewing through that queue would certainly cause delays. If those are all legit messages, then perhaps whatever is supposed to chew through those messages needs to be beefed-up? Also, looping in :jwhitlock is never a bad idea, except as it impacts his sanity.

You need to log in before you can comment on or make changes to this bug.