Closed Bug 1849625 Opened 1 year ago Closed 1 year ago

b-win2022 azure workers do not spin up in a timely manner

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: markco)

References

Details

Attachments

(3 files)

image.png 1 year ago bhearsum@mozilla.com (:bhearsum) 75.06 KB, image/png		Details
Bug 1849625 - Upgrade untrusted 2022 images with optimized startup. r?#relsre-reviewers 1 year ago Mark Cornmesser [:markco] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1849625 - Upgrade TRUSTED 2022 images with optimized startup. r?#relsre-reviewers 1 year ago Mark Cornmesser [:markco] 48 bytes, text/x-phabricator-request		Details \| Review

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

1 year ago

Attached image image.png — Details

At the time of writing we've been waiting over an hour for 28 gecko-3/b-win2022 workers to spin up. I've inspected the worker manager provisioner logs, and it appears to be requesting them just fine:

EnvVersion: "2.0"
Fields: {
desiredCapacity: 56
existingCapacity: 28
maxCapacity: 500
minCapacity: 0
pendingTasks: 28
requestedCapacity: 28
scalingRatio: 1
stoppingCapacity: 0
v: 4
workerPoolId: "gecko-3/b-win2022"
}
Hostname: "taskcluster-worker-manager-provisioner-5b4f984df7-hvgsv"
Logger: "taskcluster.worker-manager.estimator"
Pid: 1
Severity: 5
Timestamp: 1692634350015000000
Type: "simple-estimate"
serviceContext: {2}

This is also reflected in the UI (see attachment).

The pool was last active a few hours ago.

Interestingly, gecko-1/b-win2022 had workers running in the past hour.

Marking this P1/S1 as it is blocking ongoing release work.

tracking-firefox117: ? → ---

tracking-firefox118: ? → ---

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

1 year ago

Things just unwedged themselves; lowering priority. This is the second time this has happened this month though - we should try to find the root cause.

Severity: S1 → S2

Priority: P2 → --

Pete Moore [:pmoore][:pete]

Comment 2

•

1 year ago

Indeed, it would be good to get to the bottom of this.

The potential causes could be:

An issue with Worker Manager or the Azure provider code (taskcluster team)
An issue with Azure itself (Microsoft/Azure)
An issue with the machine images (relops team)

Yarik, any ideas how we can establish which of these three it was?

Flags: needinfo?(ykurmyza)

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Comment 3

•

1 year ago

so I see a somewhat increased rate of errors worker was already running which means that by the time when worker started and called registerWorker, it's state was not REQUESTED
And the reason for that seems to be that workers were stopped before after exceeding timeout with terminateAfter time exceeded

And this happens when instance is too slow to being provisioned, either because of the slowed down worker scanner loops or because Azure was slow.
grafana shows that worker scanner was not timing out, so the only reason is azure being slow. Maybe those regions were low on resources, or just the provisioning was slow.

terminateAfter is being controlled by registrationTimeout configuration parameter.
It seems to be 1800, which is 30min, so a pretty standard value.

Unless there's something else I don't see, I can assume it is azure being sloooooow

Flags: needinfo?(ykurmyza)

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

1 year ago

(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)

so I see a somewhat increased rate of errors worker was already running which means that by the time when worker started and called registerWorker, it's state was not REQUESTED
And the reason for that seems to be that workers were stopped before after exceeding timeout with terminateAfter time exceeded

And this happens when instance is too slow to being provisioned, either because of the slowed down worker scanner loops or because Azure was slow.
grafana shows that worker scanner was not timing out, so the only reason is azure being slow. Maybe those regions were low on resources, or just the provisioning was slow.

terminateAfter is being controlled by registrationTimeout configuration parameter.
It seems to be 1800, which is 30min, so a pretty standard value.

Unless there's something else I don't see, I can assume it is azure being sloooooow

Thanks for the insight!

While it's not directly related to this issue, is this something that we can propagate to the UI somehow? Eg: similar to how we get errors when we're unable to provision at all. At least then it would be easy to know the state we're in.

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Comment 5

•

1 year ago

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #4)

While it's not directly related to this issue, is this something that we can propagate to the UI somehow? Eg: similar to how we get errors when we're unable to provision at all. At least then it would be easy to know the state we're in.

Yes! We definitely have few things to improve there! We can show the pools with errors separately plus show why those requested workers are not being started on time. So when you open worker pool in UI you'd have a much better overview

Mark Cornmesser [:markco]

Assignee

Comment 6

•

1 year ago

Attached file Bug 1849625 - Upgrade untrusted 2022 images with optimized startup. r?#relsre-reviewers — Details

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → mcornmesser

Status: NEW → ASSIGNED

Mark Cornmesser [:markco]

Assignee

Comment 7

•

1 year ago

Attached file Bug 1849625 - Upgrade TRUSTED 2022 images with optimized startup. r?#relsre-reviewers — Details

Pulsebot

Comment 8

•

1 year ago

Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/fc1e9e443684 Upgrade untrusted 2022 images with optimized startup. r=relsre-reviewers,MasterWayZ

Pete Moore [:pmoore][:pete]

Comment 9

•

1 year ago

(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)

Unless there's something else I don't see, I can assume it is azure being sloooooow

I think this can be that the Windows images themselves are performing some expensive first-boot initialisation (e.g. if workers are installing toolchains on first boot, rather than having the toolchains burned into the machine image). That would be an issue with the machine images created by Mozilla, rather than Azure hardware being slow per se.

The distinction here is pretty important, since we can do a lot about inefficient machine images, but we can't do so much about slow cloud services. I think any terminations that Worker Manager triggers should be based on time since the worker launch time, rather than time when worker was requested to the cloud provider. This way, the timeout we provide shouldn't have negative effects (killing workers) that have not had reasonable time to boot up. It should only terminate workers that had plenty of time to boot up, but were not able to (i.e. bad images, not slow Azure).

Yarik, can you confirm that the 30 min timeout is from having a running operational worker, rather than from the moment we first request a worker instance from Azure? If this is the case, I think we can point directly at bad machine images created by Mozilla, rather than infrastructure issues of Azure.

Flags: needinfo?(ykurmyza)

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Comment 10

•

1 year ago

Yarik, can you confirm that the 30 min timeout is from having a running operational worker, rather than from the moment we first request a worker instance from Azure? If this is the case, I think we can point directly at bad machine images created by Mozilla, rather than infrastructure issues of Azure.

this is actually the case. 30 minutes is from the moment worker was requested, not fully launched.

Flags: needinfo?(ykurmyza) → needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 11

•

1 year ago

OK, I'll create an issue so we can change that. Maybe in the purge loop we can look at the time the instance has been running, and kill ones that have been running over 30 mins but have not yet called queue.claimWork.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 12

•

1 year ago

Created https://github.com/taskcluster/taskcluster/issues/6492

Pulsebot

Comment 13

•

1 year ago

Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/6768b0588020 Upgrade TRUSTED 2022 images with optimized startup. r=relsre-reviewers,MasterWayZ

Julien Cristau [:jcristau]

Updated

•

1 year ago

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Resolution: --- → FIXED

Julien Cristau [:jcristau]

Updated

•

1 year ago

Component: Workers → Relops: Azure

Product: Taskcluster → Infrastructure & Operations

Julien Cristau [:jcristau]

Updated

•

1 year ago

Regressions: 1851027

Mark Cornmesser [:markco]

Assignee

Comment 14

•

1 year ago

•

Edited

(In reply to Pete Moore [:pmoore][:pete] from comment #9)

I think this can be that the Windows images themselves are performing some expensive first-boot initialisation (e.g. if workers are installing toolchains on first boot, rather than having the toolchains burned into the machine image). That would be an issue with the machine images created by Mozilla, rather than Azure hardware being slow per se.

With the changes that landed this week, there has been a significant change in the time between VM start up and worker-runner starting. It has gone from 30+ minutes to around 2 minutes.

Using gj8qrccqrz60aij/ vm-au94c9yrrh2uqrsqasgrdguc1x4oxnq9ejv as an example.

first log:
Sep 01 11:20:32 gj8qrccqrz60aij.reddog.microsoft.com Microsoft-Windows-TerminalServices-LocalSessionManager Local multi-user session manager received system shutdown message
Name change to match worker name:

First worker runner log
Sep 01 11:22:16 gj8qrccqrz60aij.reddog.microsoft.com MaintainSystem StartWorkerRunner :: end - 2023-09-01T18:22:15.8625364Z

Name change takes effect on the next start up:
Sep 01 11:21:10 gj8qrccqrz60aij.reddog.microsoft.com MaintainSystem Check-AzVM-Name :: Name changed to vm-au94c9yrrh2uqrsqasgrdguc1x4oxnq9ejv

You need to log in before you can comment on or make changes to this bug.

Bugzilla

b-win2022 azure workers do not spin up in a timely manner

Categories

(Infrastructure & Operations :: Relops: Azure, defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: markco)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Updated

Updated

Comment 14

Attachment

General

Description

File Name

Content Type