Worker scanner slow deletions on unattached Azure NICs.
Categories
(Taskcluster :: Operations and Service Requests, task)
Tracking
(Not tracked)
People
(Reporter: markco, Assigned: jmoss)
References
Details
(Whiteboard: relops-jmoss)
Attachments
(6 files, 1 obsolete file)
We started receiving VM creation error messages like:
Subnet sn-west-europe-gecko-t with address prefix 10.0.0.0/24 does not have enough capacity for 1 IP addresses.
For various Azure worker pools PDT afternoon 04-11-2022. The totally unattached NICs in the console was about 1600. The assumption is the IP addresses were not available due to the number of unattached NICs still existing. Over time the number would drop by 4 or 5 but quickly come back up and go higher.
I am currently attempting to delete a significant number using a script. I will attached the output on the script to this bug once it is finished.
Reporter | ||
Comment 1•2 years ago
|
||
Reporter | ||
Comment 2•2 years ago
|
||
Reporter | ||
Comment 3•2 years ago
|
||
First run of the attached script removed 1156 NICs. Second run deleted 373.
I am not sure if this was the result of new issue or cumulative result. The other odd part is that in addition to worker-scanner there is an Azure runbook in place that runs periodically to removed orphaned resources, so I am not sure why this problem suddenly popped up. As of now it has been over an hour since the last NIC creation error. I will spot check things tomorrow.
Reporter | ||
Comment 4•2 years ago
|
||
This issue reappeared this morning at about 5 am PDT. I am working on getting an Azure runbook in place to get us through the weekend. However, this highly concerning because we are hitting these errors during low demand.
Reporter | ||
Comment 5•2 years ago
|
||
That script is no running every hour. That frequency seems to keep the unattached NIC count to below 600. Also the Taskcluster NIC creation errors seem to have stopped.
Comment 6•2 years ago
|
||
This have started on 4th of November
Comment 7•2 years ago
|
||
It also takes much longer now for scanner loop to complete, which might indicate that Azure API is extremely slow. We've seen this in the past, to the extent that some calls would not even send any response and would keep connection open forever. For that we've implemented few connection timeout tweaks.
Another reason could be increased number of instances. Possibly some launch misconfig or worker starting up but not being able to pick up any tasks because of some error.
One of the pools show really big number of workers that are in the "stopping" state: https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Fwin11-64-2009-gpu
This could cause worker scanner loop timing out, as it needs to go through all of those stopping instances and try to de-provision one resource at a time for all of them. Quick remediation for this can be cleaning resources manually and updating database to mark all of those instances as "stopped" so worker manager would not try to handle them.
Reporter | ||
Comment 8•2 years ago
|
||
There looks like there may have been an issue in that pool config that was causing an error, and the VMs never fully launched. In the console it never showed more than 1300 (late Friday afternoon), and the used cores were consistent with that number.
Could it be one try push using a worker with a bad config that put us in this state?
Assignee | ||
Comment 9•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 10•2 years ago
|
||
Assignee | ||
Comment 11•2 years ago
|
||
In worker-pools.yml, there are two locations to define the vmSize:
pool_id.config.vmsizes.vmsize
pool_id.config.vmsizes.launchConfig.hardwareprofile.vmsize
Four pools had different values which may be part of this issue.
gecko-t/win10-64-2004-perf
gecko-t/win11-64-2009-gpu
gecko-t/win11-64-2009-gpu-alpha
gecko-t/win11-64-2009-perf
Comment 12•2 years ago
|
||
I think there is room for further improvements for the scanner. As Azure requires multiple async calls for both provisioning and de-provisioning, having bigger number of instances wouldn't scale with current setup.
There are some internal timeout settings in the worker scanner that can be tweaked. At the moment it seems to be stable.
Comment 13•2 years ago
|
||
Comment 14•2 years ago
|
||
Should we try to remove the duplication somehow, or at least add a check to warn when the two values are inconsistent?
Assignee | ||
Comment 15•2 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #14)
Should we try to remove the duplication somehow, or at least add a check to warn when the two values are inconsistent?
I believe this was something @markco wants to fix. Let's confirm that this is the resolution the issues we're seeing around the network exhaustion.
Assignee | ||
Comment 16•2 years ago
|
||
Comment 17•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Description
•