Open Bug 1799263 Opened 2 years ago Updated 2 years ago

Worker scanner slow deletions on unattached Azure NICs.

Categories

(Taskcluster :: Operations and Service Requests, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: markco, Assigned: jmoss)

References

Details

(Whiteboard: relops-jmoss)

Attachments

(6 files, 1 obsolete file)

Attached image azure_nics.png

We started receiving VM creation error messages like:

Subnet sn-west-europe-gecko-t with address prefix 10.0.0.0/24 does not have enough capacity for 1 IP addresses.

For various Azure worker pools PDT afternoon 04-11-2022. The totally unattached NICs in the console was about 1600. The assumption is the IP addresses were not available due to the number of unattached NICs still existing. Over time the number would drop by 4 or 5 but quickly come back up and go higher.

I am currently attempting to delete a significant number using a script. I will attached the output on the script to this bug once it is finished.

Attached file find_nic.ps1
Attached file removed_nics.txt

First run of the attached script removed 1156 NICs. Second run deleted 373.

I am not sure if this was the result of new issue or cumulative result. The other odd part is that in addition to worker-scanner there is an Azure runbook in place that runs periodically to removed orphaned resources, so I am not sure why this problem suddenly popped up. As of now it has been over an hour since the last NIC creation error. I will spot check things tomorrow.

This issue reappeared this morning at about 5 am PDT. I am working on getting an Azure runbook in place to get us through the weekend. However, this highly concerning because we are hitting these errors during low demand.

That script is no running every hour. That frequency seems to keep the unattached NIC count to below 600. Also the Taskcluster NIC creation errors seem to have stopped.

Attached image worker-scannerpng

This have started on 4th of November

It also takes much longer now for scanner loop to complete, which might indicate that Azure API is extremely slow. We've seen this in the past, to the extent that some calls would not even send any response and would keep connection open forever. For that we've implemented few connection timeout tweaks.

Another reason could be increased number of instances. Possibly some launch misconfig or worker starting up but not being able to pick up any tasks because of some error.
One of the pools show really big number of workers that are in the "stopping" state: https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Fwin11-64-2009-gpu
This could cause worker scanner loop timing out, as it needs to go through all of those stopping instances and try to de-provision one resource at a time for all of them. Quick remediation for this can be cleaning resources manually and updating database to mark all of those instances as "stopped" so worker manager would not try to handle them.

There looks like there may have been an issue in that pool config that was causing an error, and the VMs never fully launched. In the console it never showed more than 1300 (late Friday afternoon), and the used cores were consistent with that number.

Could it be one try push using a worker with a bad config that put us in this state?

Assignee: nobody → jmoss
Status: NEW → ASSIGNED
Attachment #9302176 - Attachment is obsolete: true

In worker-pools.yml, there are two locations to define the vmSize:

pool_id.config.vmsizes.vmsize
pool_id.config.vmsizes.launchConfig.hardwareprofile.vmsize

Four pools had different values which may be part of this issue.

gecko-t/win10-64-2004-perf
gecko-t/win11-64-2009-gpu
gecko-t/win11-64-2009-gpu-alpha
gecko-t/win11-64-2009-perf

I think there is room for further improvements for the scanner. As Azure requires multiple async calls for both provisioning and de-provisioning, having bigger number of instances wouldn't scale with current setup.
There are some internal timeout settings in the worker scanner that can be tweaked. At the moment it seems to be stable.

Pushed by jmoss@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/250f72ba59c9 Correct win11 images in both places. r=releng-reviewers,markco,jcristau

Should we try to remove the duplication somehow, or at least add a check to warn when the two values are inconsistent?

Flags: needinfo?(jmoss)

(In reply to Julien Cristau [:jcristau] from comment #14)

Should we try to remove the duplication somehow, or at least add a check to warn when the two values are inconsistent?

I believe this was something @markco wants to fix. Let's confirm that this is the resolution the issues we're seeing around the network exhaustion.

Flags: needinfo?(jmoss)
Pushed by jmoss@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/ad0b8e92de89 Fix azure region mapping. r=releng-reviewers,aki
Whiteboard: relops-jmoss
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: