gecko-3/decision workers not taking tasks
Categories
(Release Engineering :: Firefox-CI Administration, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: masterwayz)
References
Details
Attachments
(4 files)
The workers in the pool gecko-3/decision stopped taking tasks. https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-3/worker-types/decision?sortBy=First%20Claim&sortDirection=desc lists no running tasks while https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&selectedTaskRun=BG1VjjKeS4CkmGtpceizig.0&searchStr=decision has 8 tasks waiting fo 40-50 minutes to run.
A similar decision worker issue had been observed in bug 1735411 last week but was not its primary focus because the workers eventually started taking tasks without human intervention.
| Reporter | ||
Comment 1•4 years ago
|
||
New worker(s) started taking tasks again.
Comment 2•4 years ago
|
||
There were four pool errors about unable to allocate capacity:
- 2021-10-18T10:43:29.690Z - No spot capacity that matches your request
- 2021-10-18T09:24:45.686Z - No spot capacity that matches your request
- 2021-10-18T08:59:22.712Z - Insufficient m5d.xlarge capacity in us-east-1c. "You can currently get m5d.xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f"
- 2021-10-18T08:53:47.191Z - No spot capacity that matches your request
This roughly lines up with when this bug was filed, but I'm assuming some spot requests were fulfilled.
I observed the pool request 2 additional workers, and these transitioned from "requested" to "running" in less than 5 minutes.
Pool capacity is calculated by workers in the "requested", "running" and "stopping" states, times the capacity per worker. The gecko-3/decision pool is an AWS pool, so it doesn't have the "stopping" state, and there is 2 capacity per worker.
Current running workers are distributed across AWS regions:
| Pool ID | Group ID | Provider ID | State | Workers | Capacity |
|---|---|---|---|---|---|
| gecko-3/decision | us-east-1 | aws | running | 5 | 10 |
| gecko-3/decision | us-east-1 | aws | stopped | 636 | 1272 |
| gecko-3/decision | us-west-1 | aws | running | 2 | 4 |
| gecko-3/decision | us-west-1 | aws | stopped | 276 | 552 |
| gecko-3/decision | us-west-2 | aws | running | 3 | 6 |
| gecko-3/decision | us-west-2 | aws | stopped | 383 | 766 |
If there was an AWS shortage, it was limited to around when this bug was opened, and is now cleared.
| Assignee | ||
Comment 3•4 years ago
|
||
Info on here; this is occurring again right now.
Looking at the AWS console, assuming that I'm on the right account and should be able to see gecko-3/decision workers too, there are no active gecko-3/decision workers, only gecko-1. All gecko-3/decision workers show terminated from a while back and no new ones exist.
| Assignee | ||
Comment 4•4 years ago
|
||
I have been refreshing the AWS console and now workers are back.
The reason behind the below theory is that I noticed a few minutes after the AWS reported no decision workers at all (not even in a terminated state), Taskcluster in its UI reported a capacity of 0 and then everything went back up and running.
Possibly this?:
- A worker gets terminated somehow.
- worker-manager doesn't know it has been terminated.
- When checking if the VM exists, it's doesn't know that it no longer exists. Maybe AWS still reports it but in a terminated state and worker-manager thinks the VM still exists.
- When the terminated VMs no longer exist on the AWS side, worker-manager no longer sees them and goes "hey, the capacity is now 0 (or below 20 at least)".
- We get back up and running.
Comment 5•4 years ago
|
||
Thanks! I was able to see the incident while it was going on.
From worker manager's perspective, there were 10 VMs (2 capacity per VM, so 20 capacity). All of them were started around 20:30 to 20:45 UTC, which was about 2 hours before this incident, and hadn't been checked since.
At 22:08Z (a few minutes later, but 100 minutes after the VMs were created), the 10 VMs were moved to "stopping" state, and 7 new VMs were launched and quickly started taking tasks.
The Worker Manager service Worker Scanner is responsible for detecting self-terminated workers, as well as provisioning and de-provisioning Azure workers (AWS workers are provisioned by the Worker Provisioner service). Worker Scanner loops through all workers, and has been slower since Azure workers have been added, and this may be especially bad when many are in the "starting" or "stopping" states.
Based on the timestamps, it really took 100 minutes for a Worker Scanner cycle. Peeking at gecko-t/win10-64-2004, there are many workers starting and stopping:
State Workers Capacity
requested 228 228
running 148 148
stopping 357 357
stopped 16047 16047
If this theory is true, decision workers will take an hour to be replaced after self-termination, until the Azure pools have stabilized.
Comment 6•4 years ago
|
||
masterwayz looked at the worker logs, so we can add self-termination to the timeline:
- 20:30 to 20:45 - Workers became active
- 21:03 - Workers self-terminated due to empty queue for 15 minutes
- 22:08 - Worker scanner detected the workers were terminated, started new ones
The worker scanner emits a log message monitor.periodic which includes the duration of each loop. Here's the GCP Log Explorer query (could be further minimized:
resource.type="k8s_container"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
jsonPayload.Type="monitor.periodic"
Most are logged at NOTICE. If the iteration hits the maximum of 100 minutes, it is logged at ERROR. There are about 10 ERRORs in the last 7 days. The worker-scanner does not have a mutex like worker-provisioner, that was associated with an incident on 2021-10-08 (https://bugzilla.mozilla.org/show_bug.cgi?id=1734959, retrospective).
I downloaded 400 of these log entries, which corresponds to about 40 hours of data.
The timeline version shows periods of lots of fast loops, with sequences of large peaks. It may be useful to expose these in Graphana for monitoring:
The histogram shows that the majority of loops are under 5 minutes, but there are outliers greater than 30 out to the max of 100 minutes:
I think this supports my theory that long worker-scanner loops correspond to delays in starting new decision task workers, and explains why it looks like we have capacity but don't. My guess is that this corresponds to times when many Azure workers are starting up and/or shutting down, but more data is needed to confirm that.
Comment 7•4 years ago
|
||
I looked into the data, and there does not appear to be a correlation between the number of Azure workers and the time for a worker-scanner loop.
The worker-scanner emits a log message "scan-seen", which counts the number of workers for each provider. Here's the query to see them in Log Viewer:
resource.type="k8s_container"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
jsonPayload.Type="scan-seen"
There is a log for each "provider". For the time period, there were 6 providers:
- aws - workers in 402 of 402 loops (402/402)
- azure - 0/402 (none)
- azure2 - 204 / 402
- fxci-level1-gcp - 4 / 402
- fxci-level3-gcp - 22 / 402
- fxci-test-gcp - 0/402
Here's a chart of number of workers versus the duration of that worker-scanner loop:
It looks like there is a possibly linear area when the worker-scanner takes less than 5 minutes, but when the worker-scanner loop takes more, it looks like a bit of chaos. I tried to fit a linear trend to this, but the error was too great, and implies that count-by-provisioner is not a a good predictor of how long a scan will take.
I no longer think that there is an issue just Azure. It looks like there are cases where the number of Azure workers was quite low, but the scanner duration was large. Instead, it looks like sometimes (but not always), the APIs are slow, whether they are AWS or Azure. There are not enough GCP workers in the sample to determine this, but I guess GCP has bad days as well.
A few possible next steps:
- Break up scan-seen by worker state (for Azure, "requested" and "stopping" do significant work)
- Log API calls and their duration, to get a feel for the expensive calls
- Add timeouts to each API call, to avoid a slow provider slowing down the whole system
- Lower the timeout on worker-scanner from 100 minutes to much lower. 30 minutes seems safe, 5 seems possible.
Comment 8•4 years ago
|
||
Scatter plot of Worker Scanner duration versus the number of workers by type.
Comment 9•4 years ago
|
||
Timeline of Worker Scanner loops. Height is the time taken by the loop.
Comment 10•4 years ago
|
||
Histogram of worker-scanner durations, showing that the majority of loops are less than 5 minutes, and a few are at the maximum of 100 minutes.
Comment 11•4 years ago
|
||
Looking at the Azure provider, there is code to retry for API limits (429) and server errors (500) with backoffs. It isn't clear to me if these backoff delays accumulate or reset after a successful request. It also appears that they are logged, but I'll need to read more code to determine the mechanism and the log message type. There are some promising message types in the docs, such as monitor.aws for reporting AWS API calls.
I think the next research step is to look for handling and logging of timeouts or backoff delays in the AWS and Azure code, look for any historical instances in the logs, and see if they correlate with long scanner loops. I hope to look at this late this week or next week, but someone else with GCP log access should feel free to investigate first.
Comment 12•4 years ago
•
|
||
Workers once again became unresponsive at Thu, Oct 21, 11:37:32 GMT. Currently new ones have not been spawned.
LE: workers started taking jobs at 13:15:36 GMT.
| Assignee | ||
Comment 13•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 14•4 years ago
|
||
| Assignee | ||
Comment 15•4 years ago
|
||
Push went through; worker definitions are updated. Let's hope that this "workaround" can fix it short-term until we can fix the real issue.
Comment 16•4 years ago
|
||
When looking for other data (number of workers in a pool), I discovered that the worker-scanner loop time can be monitored in Graphana. The two graphs are in the "Background Tasks" section of this dashboard:
https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?orgId=1
The "Successful Background Task Timings" panel shows all task timings, and can be filtered to only show "taskcluster-worker-manager-workerscanner":
The failed tasks can be seen on "Failed Background Task Timings" panel. All of the failed tasks today are workerscanner, due to hitting the 100 minute limit.
https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?orgId=1&viewPanel=16
I'm not sure why this shows 2.22 hours (134 minutes) instead of 100 minutes. Maybe there is some rounding into histogram buckets.
If the data is available in Graphana, it may be possible to create some alerts. However, I'm not sure what the action should be if worker scanner is taking a long time to complete a loop.
| Assignee | ||
Comment 17•4 years ago
•
|
||
Putting S3 as this hasn't happened for a while. If it occurs and causes a major issue again then we'll have to consider bumping it up.
| Assignee | ||
Comment 18•4 years ago
|
||
Do you think we're good with closing this, or do you think this needs some extra investigation and/or documentation/patch making?
| Assignee | ||
Comment 20•4 years ago
|
||
Closing it then!
Description
•