b-win2022 azure workers do not spin up in a timely manner
Categories
(Infrastructure & Operations :: Relops: Azure, defect)
Tracking
(Not tracked)
People
(Reporter: bhearsum, Assigned: markco)
References
Details
Attachments
(3 files)
At the time of writing we've been waiting over an hour for 28 gecko-3/b-win2022 workers to spin up. I've inspected the worker manager provisioner logs, and it appears to be requesting them just fine:
EnvVersion: "2.0"
Fields: {
desiredCapacity: 56
existingCapacity: 28
maxCapacity: 500
minCapacity: 0
pendingTasks: 28
requestedCapacity: 28
scalingRatio: 1
stoppingCapacity: 0
v: 4
workerPoolId: "gecko-3/b-win2022"
}
Hostname: "taskcluster-worker-manager-provisioner-5b4f984df7-hvgsv"
Logger: "taskcluster.worker-manager.estimator"
Pid: 1
Severity: 5
Timestamp: 1692634350015000000
Type: "simple-estimate"
serviceContext: {2}
This is also reflected in the UI (see attachment).
The pool was last active a few hours ago.
Interestingly, gecko-1/b-win2022
had workers running in the past hour.
Marking this P1/S1 as it is blocking ongoing release work.
Reporter | ||
Comment 1•1 year ago
|
||
Things just unwedged themselves; lowering priority. This is the second time this has happened this month though - we should try to find the root cause.
Comment 2•1 year ago
|
||
Indeed, it would be good to get to the bottom of this.
The potential causes could be:
- An issue with Worker Manager or the Azure provider code (taskcluster team)
- An issue with Azure itself (Microsoft/Azure)
- An issue with the machine images (relops team)
Yarik, any ideas how we can establish which of these three it was?
Comment 3•1 year ago
|
||
so I see a somewhat increased rate of errors worker was already running
which means that by the time when worker started and called registerWorker
, it's state was not REQUESTED
And the reason for that seems to be that workers were stopped before after exceeding timeout with terminateAfter time exceeded
And this happens when instance is too slow to being provisioned, either because of the slowed down worker scanner loops or because Azure was slow.
grafana shows that worker scanner was not timing out, so the only reason is azure being slow. Maybe those regions were low on resources, or just the provisioning was slow.
terminateAfter
is being controlled by registrationTimeout
configuration parameter.
It seems to be 1800, which is 30min, so a pretty standard value.
Unless there's something else I don't see, I can assume it is azure being sloooooow
Reporter | ||
Comment 4•1 year ago
|
||
(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)
so I see a somewhat increased rate of errors
worker was already running
which means that by the time when worker started and calledregisterWorker
, it's state was notREQUESTED
And the reason for that seems to be that workers were stopped before after exceeding timeout withterminateAfter time exceeded
And this happens when instance is too slow to being provisioned, either because of the slowed down worker scanner loops or because Azure was slow.
grafana shows that worker scanner was not timing out, so the only reason is azure being slow. Maybe those regions were low on resources, or just the provisioning was slow.
terminateAfter
is being controlled byregistrationTimeout
configuration parameter.
It seems to be 1800, which is 30min, so a pretty standard value.Unless there's something else I don't see, I can assume it is azure being sloooooow
Thanks for the insight!
While it's not directly related to this issue, is this something that we can propagate to the UI somehow? Eg: similar to how we get errors when we're unable to provision at all. At least then it would be easy to know the state we're in.
Comment 5•1 year ago
|
||
(In reply to bhearsum@mozilla.com (:bhearsum) from comment #4)
While it's not directly related to this issue, is this something that we can propagate to the UI somehow? Eg: similar to how we get errors when we're unable to provision at all. At least then it would be easy to know the state we're in.
Yes! We definitely have few things to improve there! We can show the pools with errors separately plus show why those requested workers are not being started on time. So when you open worker pool in UI you'd have a much better overview
Assignee | ||
Comment 6•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 7•1 year ago
|
||
Comment 9•1 year ago
|
||
(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)
Unless there's something else I don't see, I can assume it is azure being sloooooow
I think this can be that the Windows images themselves are performing some expensive first-boot initialisation (e.g. if workers are installing toolchains on first boot, rather than having the toolchains burned into the machine image). That would be an issue with the machine images created by Mozilla, rather than Azure hardware being slow per se.
The distinction here is pretty important, since we can do a lot about inefficient machine images, but we can't do so much about slow cloud services. I think any terminations that Worker Manager triggers should be based on time since the worker launch time, rather than time when worker was requested to the cloud provider. This way, the timeout we provide shouldn't have negative effects (killing workers) that have not had reasonable time to boot up. It should only terminate workers that had plenty of time to boot up, but were not able to (i.e. bad images, not slow Azure).
Yarik, can you confirm that the 30 min timeout is from having a running operational worker, rather than from the moment we first request a worker instance from Azure? If this is the case, I think we can point directly at bad machine images created by Mozilla, rather than infrastructure issues of Azure.
Comment 10•1 year ago
|
||
Yarik, can you confirm that the 30 min timeout is from having a running operational worker, rather than from the moment we first request a worker instance from Azure? If this is the case, I think we can point directly at bad machine images created by Mozilla, rather than infrastructure issues of Azure.
this is actually the case. 30 minutes is from the moment worker was requested, not fully launched.
Comment 11•1 year ago
|
||
OK, I'll create an issue so we can change that. Maybe in the purge loop we can look at the time the instance has been running, and kill ones that have been running over 30 mins but have not yet called queue.claimWork.
Comment 12•1 year ago
|
||
Comment 13•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 14•1 year ago
•
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #9)
I think this can be that the Windows images themselves are performing some expensive first-boot initialisation (e.g. if workers are installing toolchains on first boot, rather than having the toolchains burned into the machine image). That would be an issue with the machine images created by Mozilla, rather than Azure hardware being slow per se.
With the changes that landed this week, there has been a significant change in the time between VM start up and worker-runner starting. It has gone from 30+ minutes to around 2 minutes.
Using gj8qrccqrz60aij/ vm-au94c9yrrh2uqrsqasgrdguc1x4oxnq9ejv as an example.
first log:
Sep 01 11:20:32 gj8qrccqrz60aij.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Local multi-user session manager received system shutdown message
Name change to match worker name:
First worker runner log
Sep 01 11:22:16 gj8qrccqrz60aij.reddog.microsoft.com MaintainSystem StartWorkerRunner :: end - 2023-09-01T18:22:15.8625364Z
Name change takes effect on the next start up:
Sep 01 11:21:10 gj8qrccqrz60aij.reddog.microsoft.com MaintainSystem Check-AzVM-Name :: Name changed to vm-au94c9yrrh2uqrsqasgrdguc1x4oxnq9ejv
Description
•