Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster
Categories
(Tree Management :: Treeherder: Data Ingestion, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Unassigned)
Details
Attachments
(3 files)
Code sheriffs have observed Treeherder lagging the display of Taskcluster task data at least 3 times recently. 2 of them got observed around 10pm UTC when the Nightlies start and tasks showed up with 8-15 minutes delay which renders it impossible to cancel cron tasks (e.g. the ones which schedule new Nightlies).
Is there any tool in GCP which allows to evaluate the state of the backlog and to which I have access?
Updated•3 years ago
|
Comment 1•3 years ago
|
||
Situation recovered after Jeremy restarted the database and workers and added memory.
Updated•3 years ago
|
![]() |
Reporter | |
Comment 2•2 years ago
|
||
Yesterday (Monday) at ~22:30 UTC the delay in task ingestion got observed again, confusing people who pushed to Try because it seemed like no tasks were getting scheduled.
Does the data ingestion pipeline fall back with the load? If yes, could we get insight into what gets ingested? We had the Beta and Nightly builds running, the most load heavy tasks were the performance ones in the past but I didn't identify a cron task scheduling such before the delay started.
![]() |
Reporter | |
Comment 3•2 years ago
|
||
RyanVM reports ingestting is delayed again. It will likely catch up but eventually this needs investigation.
Comment 4•2 years ago
|
||
I looked at all the things I could think of (which I'm sure is missing some pieces). Today (Monday, Aug 16):
- CloudAMQP for
treeherder-prod
had a spike in messages between 18:36 UTC to a high of almost 10,000 "ready" messages by 19:08, then fell back down to zero by 20:13 UTC - One of the kubernetes clusters' nodes spiked up to 100% CPU from 19:31 UTC to 20:11 UTC
- It looks like the pod in question was
treeherder-log-parser
- It looks like the pod in question was
- The MySQL database for treeherder prod didn't see any changes in CPU usage for this time period
So I guess some questions are:
- Did treeherder "catch-up" by 20:11 UTC this time?
- Does anyone know what the chain-of-systems is here? What components should I be looking at for bottlenecks?
![]() |
Reporter | |
Comment 5•2 years ago
|
||
- Did treeherder "catch-up" by 20:11 UTC this time?
Based on the information: Yes.
- Does anyone know what the chain-of-systems is here? What components should I be looking at for bottlenecks?
Treeherder gets the pulse notification and seems busy handling the log parsing. We are likely looking for log artifacts of jobs which ended at 18:36 UTC or <5 minutes before and are big (uncompressed size) and keep the log parser busy (bottleneck reading the log). The 100% later could be the decompression of logs of normal size (much smaller than the ones causing the delay).
Comment 6•2 years ago
|
||
I can't see any obvious problems in the treeherder-prod
logs, but there're thousands of log entries per minute, so it's tough to narrow down.
I'd be happy to give you access to the logs (or whatever else) if that would help in tracking this down ... ?
![]() |
Reporter | |
Comment 7•2 years ago
|
||
Thank you, please grant the access to shengst@m.c
Comment 8•2 years ago
|
||
I added you as a "Logs Viewer" for the treeherder
folder, which should let you see treeherder-prod logs, as well as treeherder-nonprod, which could also help with troubleshooting https://bugzilla.mozilla.org/show_bug.cgi?id=1429030
![]() |
Reporter | |
Comment 9•2 years ago
|
||
This affects TH production again. Switching to any measurement for SQL performance except the default "CPU utilization" at https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1/overview?authuser=0&project=moz-fx-treeherder-prod-c739&supportedpurview=project doesn't work for me (at least the pages don't require always a reload this time). Chris, does this need additional permissions?
Comment 10•2 years ago
•
|
||
It shouldn't require any additional perms. You have "viewer" for the entire project.
I looked at the MySQL Queries graph and it shows a spike from about 600 queries/s (baseline for the last hour) to 1,865 queries/s at 3:43pm (Pacific time), then back to baseline 1 minute later.
The other graphs don't show any bump at that time.
A backup started 5 minutes after that time (at 3:48p pacific), so that doesn't seem related.
And mysql error logs show nothing for the last hour.
Comment 11•2 years ago
•
|
||
Additionally, no slow queries (longer than 1 second to complete) were logged in the slow query log during that time.
![]() |
Reporter | |
Comment 12•2 years ago
|
||
The issue we are looking for started at ~3:00pm PDT or earlier based on the delay with which the tasks are showing up on Treeherder.
Comment 13•2 years ago
|
||
Only 2 slow queries were logged between 2p and 4p (pacific). Both took between 1 and 2 seconds to complete, so that's probably not it.
Between 2:53p and 3:02p there was a spike in MySQL egress traffic that seems slightly anomalous, but that doesn't seem too helpful either.
I don't see any other weird spikes over the last 6 hours.
![]() |
Reporter | |
Comment 14•2 years ago
|
||
Could the egress traffic cause a backlog in writes?
What looks unexpected is the fact the Log Explorer lists received push message from
messages which are logged with logger.info
while the received job message from
messages are missing (source). Any idea what causes the difference?
Comment 15•2 years ago
|
||
I'm just guessing here, but the received job message from
log entries are at the debug
level, so probably the overall app configuration is set to not emit those. I see the received push message from
entries are at the info
level, and even then they're just being emitted to stderr.
![]() |
Reporter | |
Comment 16•2 years ago
|
||
This has been observed again, code sheriffs reached out at 18:09 UTC (20:09 in the screenshot) for pushes with an age of 10 minutes having had no decision task scheduled. The prototype instance was also affected which means the delay should not be from requests against the instance. The screenshot contains the count of notifications about jobs ("tasks") processed by Treeherder (search term is job message
) on the non-prod instance which seems to log differently (no slow SQL logging but logs logger.debug
instead?). The increase stems mostly from the CI to release candidates for Firefox 92.0, Firefox 78.14.0 and likely also mobile releases. Most log messages mentions signing related tasks (signing
, beetmover
).
Chris, could you log the size of the pulse queue which will tell us if the output from Taskcluster outpaces Treeherder ingestion or if it is an issue with Taskcluster's exchange with Pulse or Pulse itself?
Should be loop e.g. jwhitlock from Taskcluster in and try to get data if the count or size of logs to parse increased when this hit?
The maximum delay was ~30 minutes this time, Treeherder caught up ~19:15 UTC. Production trees were closed for an hour.
The time window for the incidence on 2021-08-23 doesn't show this peak.
Comment 17•2 years ago
|
||
Comment 18•2 years ago
|
||
Right around 17:16 UTC on Sept 1, I see a huge spike in messages to the treeherder-prod and treeherder-prototype queues. In the last seven days the average is around maybe 500 messages, but at that time the number of messages increases to 25,000 (prod) and 30,000 (prototype).
Comment 19•2 years ago
|
||
Seems like chewing through that queue would certainly cause delays. If those are all legit messages, then perhaps whatever is supposed to chew through those messages needs to be beefed-up? Also, looping in :jwhitlock is never a bad idea, except as it impacts his sanity.
Description
•