1736329 - gecko-3/decision workers not taking tasks

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

4 years ago

The workers in the pool gecko-3/decision stopped taking tasks. https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-3/worker-types/decision?sortBy=First%20Claim&sortDirection=desc lists no running tasks while https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&selectedTaskRun=BG1VjjKeS4CkmGtpceizig.0&searchStr=decision has 8 tasks waiting fo 40-50 minutes to run.

A similar decision worker issue had been observed in bug 1735411 last week but was not its primary focus because the workers eventually started taking tasks without human intervention.

Flags: needinfo?(pmoore)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 1

•

4 years ago

New worker(s) started taking tasks again.

Summary: gecko-3/decision workers not taking tasks - production trees closed → gecko-3/decision workers not taking tasks

John Whitlock [:jwhitlock]

Comment 2

•

4 years ago

There were four pool errors about unable to allocate capacity:

2021-10-18T10:43:29.690Z - No spot capacity that matches your request
2021-10-18T09:24:45.686Z - No spot capacity that matches your request
2021-10-18T08:59:22.712Z - Insufficient m5d.xlarge capacity in us-east-1c. "You can currently get m5d.xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f"
2021-10-18T08:53:47.191Z - No spot capacity that matches your request

This roughly lines up with when this bug was filed, but I'm assuming some spot requests were fulfilled.

I observed the pool request 2 additional workers, and these transitioned from "requested" to "running" in less than 5 minutes.

Pool capacity is calculated by workers in the "requested", "running" and "stopping" states, times the capacity per worker. The gecko-3/decision pool is an AWS pool, so it doesn't have the "stopping" state, and there is 2 capacity per worker.

Current running workers are distributed across AWS regions:

Pool ID	Group ID	Provider ID	State	Workers	Capacity
gecko-3/decision	us-east-1	aws	running	5	10
gecko-3/decision	us-east-1	aws	stopped	636	1272
gecko-3/decision	us-west-1	aws	running	2	4
gecko-3/decision	us-west-1	aws	stopped	276	552
gecko-3/decision	us-west-2	aws	running	3	6
gecko-3/decision	us-west-2	aws	stopped	383	766

If there was an AWS shortage, it was limited to around when this bug was opened, and is now cleared.

Flags: needinfo?(pmoore)

Michelle Goossens [:masterwayz]

Assignee

Comment 3

•

4 years ago

Info on here; this is occurring again right now.
Looking at the AWS console, assuming that I'm on the right account and should be able to see gecko-3/decision workers too, there are no active gecko-3/decision workers, only gecko-1. All gecko-3/decision workers show terminated from a while back and no new ones exist.

Michelle Goossens [:masterwayz]

Assignee

Comment 4

•

4 years ago

I have been refreshing the AWS console and now workers are back.

The reason behind the below theory is that I noticed a few minutes after the AWS reported no decision workers at all (not even in a terminated state), Taskcluster in its UI reported a capacity of 0 and then everything went back up and running.

Possibly this?:

A worker gets terminated somehow.
worker-manager doesn't know it has been terminated.
When checking if the VM exists, it's doesn't know that it no longer exists. Maybe AWS still reports it but in a terminated state and worker-manager thinks the VM still exists.
When the terminated VMs no longer exist on the AWS side, worker-manager no longer sees them and goes "hey, the capacity is now 0 (or below 20 at least)".
We get back up and running.

John Whitlock [:jwhitlock]

Comment 5

•

4 years ago

Thanks! I was able to see the incident while it was going on.

From worker manager's perspective, there were 10 VMs (2 capacity per VM, so 20 capacity). All of them were started around 20:30 to 20:45 UTC, which was about 2 hours before this incident, and hadn't been checked since.

At 22:08Z (a few minutes later, but 100 minutes after the VMs were created), the 10 VMs were moved to "stopping" state, and 7 new VMs were launched and quickly started taking tasks.

The Worker Manager service Worker Scanner is responsible for detecting self-terminated workers, as well as provisioning and de-provisioning Azure workers (AWS workers are provisioned by the Worker Provisioner service). Worker Scanner loops through all workers, and has been slower since Azure workers have been added, and this may be especially bad when many are in the "starting" or "stopping" states.

Based on the timestamps, it really took 100 minutes for a Worker Scanner cycle. Peeking at gecko-t/win10-64-2004, there are many workers starting and stopping:

State      Workers  Capacity
requested  228      228     
running    148      148     
stopping   357      357     
stopped    16047    16047

If this theory is true, decision workers will take an hour to be replaced after self-termination, until the Azure pools have stabilized.

John Whitlock [:jwhitlock]

Comment 6

•

4 years ago

masterwayz looked at the worker logs, so we can add self-termination to the timeline:

20:30 to 20:45 - Workers became active
21:03 - Workers self-terminated due to empty queue for 15 minutes
22:08 - Worker scanner detected the workers were terminated, started new ones

The worker scanner emits a log message monitor.periodic which includes the duration of each loop. Here's the GCP Log Explorer query (could be further minimized:

resource.type="k8s_container"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
jsonPayload.Type="monitor.periodic"

Most are logged at NOTICE. If the iteration hits the maximum of 100 minutes, it is logged at ERROR. There are about 10 ERRORs in the last 7 days. The worker-scanner does not have a mutex like worker-provisioner, that was associated with an incident on 2021-10-08 (https://bugzilla.mozilla.org/show_bug.cgi?id=1734959, retrospective).

I downloaded 400 of these log entries, which corresponds to about 40 hours of data.

The timeline version shows periods of lots of fast loops, with sequences of large peaks. It may be useful to expose these in Graphana for monitoring:

https://docs.google.com/spreadsheets/d/e/2PACX-1vT75ONIZCUP8ik2vkpGO2eUoRopk8yrPv6CvSPP3B7y4RL3albWKr8Q_WW_8NWYKhgZzdMZf2lvVAWu/pubhtml?gid=123667418&single=true

The histogram shows that the majority of loops are under 5 minutes, but there are outliers greater than 30 out to the max of 100 minutes:

https://docs.google.com/spreadsheets/d/e/2PACX-1vT75ONIZCUP8ik2vkpGO2eUoRopk8yrPv6CvSPP3B7y4RL3albWKr8Q_WW_8NWYKhgZzdMZf2lvVAWu/pubhtml?gid=2029172537&single=true

I think this supports my theory that long worker-scanner loops correspond to delays in starting new decision task workers, and explains why it looks like we have capacity but don't. My guess is that this corresponds to times when many Azure workers are starting up and/or shutting down, but more data is needed to confirm that.

John Whitlock [:jwhitlock]

Comment 7

•

4 years ago

I looked into the data, and there does not appear to be a correlation between the number of Azure workers and the time for a worker-scanner loop.

The worker-scanner emits a log message "scan-seen", which counts the number of workers for each provider. Here's the query to see them in Log Viewer:

resource.type="k8s_container"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
jsonPayload.Type="scan-seen"

There is a log for each "provider". For the time period, there were 6 providers:

aws - workers in 402 of 402 loops (402/402)
azure - 0/402 (none)
azure2 - 204 / 402
fxci-level1-gcp - 4 / 402
fxci-level3-gcp - 22 / 402
fxci-test-gcp - 0/402

Here's a chart of number of workers versus the duration of that worker-scanner loop:

https://docs.google.com/spreadsheets/d/e/2PACX-1vT75ONIZCUP8ik2vkpGO2eUoRopk8yrPv6CvSPP3B7y4RL3albWKr8Q_WW_8NWYKhgZzdMZf2lvVAWu/pubhtml?gid=418414151&single=true

It looks like there is a possibly linear area when the worker-scanner takes less than 5 minutes, but when the worker-scanner loop takes more, it looks like a bit of chaos. I tried to fit a linear trend to this, but the error was too great, and implies that count-by-provisioner is not a a good predictor of how long a scan will take.

I no longer think that there is an issue just Azure. It looks like there are cases where the number of Azure workers was quite low, but the scanner duration was large. Instead, it looks like sometimes (but not always), the APIs are slow, whether they are AWS or Azure. There are not enough GCP workers in the sample to determine this, but I guess GCP has bad days as well.

A few possible next steps:

Break up scan-seen by worker state (for Azure, "requested" and "stopping" do significant work)
Log API calls and their duration, to get a feel for the expensive calls
Add timeouts to each API call, to avoid a slow provider slowing down the whole system
Lower the timeout on worker-scanner from 100 minutes to much lower. 30 minutes seems safe, 5 seems possible.

John Whitlock [:jwhitlock]

Comment 8

•

4 years ago

Attached image Worker Scanner duration vs Worker Type.png — Details

Scatter plot of Worker Scanner duration versus the number of workers by type.

Screenshot of https://docs.google.com/spreadsheets/d/e/2PACX-1vT75ONIZCUP8ik2vkpGO2eUoRopk8yrPv6CvSPP3B7y4RL3albWKr8Q_WW_8NWYKhgZzdMZf2lvVAWu/pubhtml?gid=418414151&single=true

John Whitlock [:jwhitlock]

Comment 9

•

4 years ago

Attached image Worker Scanner Loop Duration Timeline.png — Details

Timeline of Worker Scanner loops. Height is the time taken by the loop.

Screenshot of https://docs.google.com/spreadsheets/d/e/2PACX-1vT75ONIZCUP8ik2vkpGO2eUoRopk8yrPv6CvSPP3B7y4RL3albWKr8Q_WW_8NWYKhgZzdMZf2lvVAWu/pubhtml?gid=123667418&single=true

John Whitlock [:jwhitlock]

Comment 10

•

4 years ago

Attached image Worker Scanner Loop Duration Histogram.png — Details

Histogram of worker-scanner durations, showing that the majority of loops are less than 5 minutes, and a few are at the maximum of 100 minutes.

Screenshot of https://docs.google.com/spreadsheets/d/e/2PACX-1vT75ONIZCUP8ik2vkpGO2eUoRopk8yrPv6CvSPP3B7y4RL3albWKr8Q_WW_8NWYKhgZzdMZf2lvVAWu/pubhtml?gid=2029172537&single=true

John Whitlock [:jwhitlock]

Comment 11

•

4 years ago

Looking at the Azure provider, there is code to retry for API limits (429) and server errors (500) with backoffs. It isn't clear to me if these backoff delays accumulate or reset after a successful request. It also appears that they are logged, but I'll need to read more code to determine the mechanism and the log message type. There are some promising message types in the docs, such as monitor.aws for reporting AWS API calls.

I think the next research step is to look for handling and logging of timeouts or backoff delays in the AWS and Azure code, look for any historical instances in the logs, and see if they correlate with long scanner loops. I hope to look at this late this week or next week, but someone else with GCP log access should feel free to investigate first.

Cosmin Sabou [:CosminS]

Comment 12

•

4 years ago

•

Edited

Workers once again became unresponsive at Thu, Oct 21, 11:37:32 GMT. Currently new ones have not been spawned.
LE: workers started taking jobs at 13:15:36 GMT.

Michelle Goossens [:masterwayz]

Assignee

Comment 13

•

4 years ago

Attached file Bug 1736329 - Adjust afterIdleSeconds for decision workers to avoid long delays r=#releng-reviewers — Details

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → michelle

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

4 years ago

Attachment #9247128 - Attachment description: Bug 1736329 - Adjust idleTimeout for decision workers to avoid long delays r=#releng-reviewers → Bug 1736329 - Adjust afterIdleSeconds for decision workers to avoid long delays r=#releng-reviewers

Pulsebot

Comment 14

•

4 years ago

Pushed by michelle@masterwayz.nl: https://hg.mozilla.org/ci/ci-configuration/rev/6bbb2d5c3bdb Adjust afterIdleSeconds for decision workers to avoid long delays r=releng-reviewers,jmaher

Michelle Goossens [:masterwayz]

Assignee

Comment 15

•

4 years ago

Push went through; worker definitions are updated. Let's hope that this "workaround" can fix it short-term until we can fix the real issue.

John Whitlock [:jwhitlock]

Comment 16

•

4 years ago

When looking for other data (number of workers in a pool), I discovered that the worker-scanner loop time can be monitored in Graphana. The two graphs are in the "Background Tasks" section of this dashboard:

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?orgId=1

The "Successful Background Task Timings" panel shows all task timings, and can be filtered to only show "taskcluster-worker-manager-workerscanner":

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?orgId=1&from=now-7d&to=now&viewPanel=15

The failed tasks can be seen on "Failed Background Task Timings" panel. All of the failed tasks today are workerscanner, due to hitting the 100 minute limit.

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?orgId=1&viewPanel=16

I'm not sure why this shows 2.22 hours (134 minutes) instead of 100 minutes. Maybe there is some rounding into histogram buckets.

If the data is available in Graphana, it may be possible to create some alerts. However, I'm not sure what the action should be if worker scanner is taking a long time to complete a loop.

John Whitlock [:jwhitlock]

Updated

•

4 years ago

Updated

•

4 years ago

Comment 17

•

4 years ago

•

Edited

Putting S3 as this hasn't happened for a while. If it occurs and causes a major issue again then we'll have to consider bumping it up.

Severity: -- → S3

Michelle Goossens [:masterwayz]

Assignee

Comment 18

•

4 years ago

Do you think we're good with closing this, or do you think this needs some extra investigation and/or documentation/patch making?

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 19

•

4 years ago

Thanks you, it's fine like this.

Flags: needinfo?(aryx.bugmail)

Michelle Goossens [:masterwayz]

Assignee

Comment 20

•

4 years ago

Closing it then!

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Worker Scanner duration vs Worker Type.png 4 years ago John Whitlock [:jwhitlock] 69.25 KB, image/png		Details
Worker Scanner Loop Duration Timeline.png 4 years ago John Whitlock [:jwhitlock] 96.00 KB, image/png		Details
Worker Scanner Loop Duration Histogram.png 4 years ago John Whitlock [:jwhitlock] 39.56 KB, image/png		Details
Bug 1736329 - Adjust afterIdleSeconds for decision workers to avoid long delays r=#releng-reviewers 4 years ago Michelle Goossens [:masterwayz] 48 bytes, text/x-phabricator-request		Details \| Review