Closed Bug 1849625 Opened 1 year ago Closed 1 year ago

b-win2022 azure workers do not spin up in a timely manner

Categories

(Infrastructure & Operations :: Relops: Azure, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: markco)

References

Details

Attachments

(3 files)

Attached image image.png

At the time of writing we've been waiting over an hour for 28 gecko-3/b-win2022 workers to spin up. I've inspected the worker manager provisioner logs, and it appears to be requesting them just fine:

EnvVersion: "2.0"
Fields: {
desiredCapacity: 56
existingCapacity: 28
maxCapacity: 500
minCapacity: 0
pendingTasks: 28
requestedCapacity: 28
scalingRatio: 1
stoppingCapacity: 0
v: 4
workerPoolId: "gecko-3/b-win2022"
}
Hostname: "taskcluster-worker-manager-provisioner-5b4f984df7-hvgsv"
Logger: "taskcluster.worker-manager.estimator"
Pid: 1
Severity: 5
Timestamp: 1692634350015000000
Type: "simple-estimate"
serviceContext: {2}

This is also reflected in the UI (see attachment).

The pool was last active a few hours ago.

Interestingly, gecko-1/b-win2022 had workers running in the past hour.

Marking this P1/S1 as it is blocking ongoing release work.

Things just unwedged themselves; lowering priority. This is the second time this has happened this month though - we should try to find the root cause.

Severity: S1 → S2
Priority: P2 → --

Indeed, it would be good to get to the bottom of this.

The potential causes could be:

  • An issue with Worker Manager or the Azure provider code (taskcluster team)
  • An issue with Azure itself (Microsoft/Azure)
  • An issue with the machine images (relops team)

Yarik, any ideas how we can establish which of these three it was?

Flags: needinfo?(ykurmyza)

so I see a somewhat increased rate of errors worker was already running which means that by the time when worker started and called registerWorker, it's state was not REQUESTED
And the reason for that seems to be that workers were stopped before after exceeding timeout with terminateAfter time exceeded

And this happens when instance is too slow to being provisioned, either because of the slowed down worker scanner loops or because Azure was slow.
grafana shows that worker scanner was not timing out, so the only reason is azure being slow. Maybe those regions were low on resources, or just the provisioning was slow.

terminateAfter is being controlled by registrationTimeout configuration parameter.
It seems to be 1800, which is 30min, so a pretty standard value.

Unless there's something else I don't see, I can assume it is azure being sloooooow

Flags: needinfo?(ykurmyza)

(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)

so I see a somewhat increased rate of errors worker was already running which means that by the time when worker started and called registerWorker, it's state was not REQUESTED
And the reason for that seems to be that workers were stopped before after exceeding timeout with terminateAfter time exceeded

And this happens when instance is too slow to being provisioned, either because of the slowed down worker scanner loops or because Azure was slow.
grafana shows that worker scanner was not timing out, so the only reason is azure being slow. Maybe those regions were low on resources, or just the provisioning was slow.

terminateAfter is being controlled by registrationTimeout configuration parameter.
It seems to be 1800, which is 30min, so a pretty standard value.

Unless there's something else I don't see, I can assume it is azure being sloooooow

Thanks for the insight!

While it's not directly related to this issue, is this something that we can propagate to the UI somehow? Eg: similar to how we get errors when we're unable to provision at all. At least then it would be easy to know the state we're in.

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #4)

While it's not directly related to this issue, is this something that we can propagate to the UI somehow? Eg: similar to how we get errors when we're unable to provision at all. At least then it would be easy to know the state we're in.

Yes! We definitely have few things to improve there! We can show the pools with errors separately plus show why those requested workers are not being started on time. So when you open worker pool in UI you'd have a much better overview

Assignee: nobody → mcornmesser
Status: NEW → ASSIGNED
Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/fc1e9e443684 Upgrade untrusted 2022 images with optimized startup. r=relsre-reviewers,MasterWayZ

(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)

Unless there's something else I don't see, I can assume it is azure being sloooooow

I think this can be that the Windows images themselves are performing some expensive first-boot initialisation (e.g. if workers are installing toolchains on first boot, rather than having the toolchains burned into the machine image). That would be an issue with the machine images created by Mozilla, rather than Azure hardware being slow per se.

The distinction here is pretty important, since we can do a lot about inefficient machine images, but we can't do so much about slow cloud services. I think any terminations that Worker Manager triggers should be based on time since the worker launch time, rather than time when worker was requested to the cloud provider. This way, the timeout we provide shouldn't have negative effects (killing workers) that have not had reasonable time to boot up. It should only terminate workers that had plenty of time to boot up, but were not able to (i.e. bad images, not slow Azure).

Yarik, can you confirm that the 30 min timeout is from having a running operational worker, rather than from the moment we first request a worker instance from Azure? If this is the case, I think we can point directly at bad machine images created by Mozilla, rather than infrastructure issues of Azure.

Flags: needinfo?(ykurmyza)

Yarik, can you confirm that the 30 min timeout is from having a running operational worker, rather than from the moment we first request a worker instance from Azure? If this is the case, I think we can point directly at bad machine images created by Mozilla, rather than infrastructure issues of Azure.

this is actually the case. 30 minutes is from the moment worker was requested, not fully launched.

Flags: needinfo?(ykurmyza) → needinfo?(pmoore)

OK, I'll create an issue so we can change that. Maybe in the purge loop we can look at the time the instance has been running, and kill ones that have been running over 30 mins but have not yet called queue.claimWork.

Flags: needinfo?(pmoore)
Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/6768b0588020 Upgrade TRUSTED 2022 images with optimized startup. r=relsre-reviewers,MasterWayZ
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Component: Workers → Relops: Azure
Product: Taskcluster → Infrastructure & Operations
Regressions: 1851027

(In reply to Pete Moore [:pmoore][:pete] from comment #9)

I think this can be that the Windows images themselves are performing some expensive first-boot initialisation (e.g. if workers are installing toolchains on first boot, rather than having the toolchains burned into the machine image). That would be an issue with the machine images created by Mozilla, rather than Azure hardware being slow per se.

With the changes that landed this week, there has been a significant change in the time between VM start up and worker-runner starting. It has gone from 30+ minutes to around 2 minutes.

Using gj8qrccqrz60aij/ vm-au94c9yrrh2uqrsqasgrdguc1x4oxnq9ejv as an example.

first log:
Sep 01 11:20:32 gj8qrccqrz60aij.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Local multi-user session manager received system shutdown message
Name change to match worker name:

First worker runner log
Sep 01 11:22:16 gj8qrccqrz60aij.reddog.microsoft.com MaintainSystem StartWorkerRunner :: end - 2023-09-01T18:22:15.8625364Z

Name change takes effect on the next start up:
Sep 01 11:21:10 gj8qrccqrz60aij.reddog.microsoft.com MaintainSystem Check-AzVM-Name :: Name changed to vm-au94c9yrrh2uqrsqasgrdguc1x4oxnq9ejv

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: