Closed Bug 1739191 Opened 4 years ago Closed 3 years ago

gecko-t/t-win7-32 workers not getting provisioned, high count of provisioned workers not taking tasks, no Windows 7 test coverage

Categories

(Release Engineering :: Firefox-CI Administration, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: grenade)

References

Details

Attachments

(2 files)

https://firefox-ci-tc.services.mozilla.com/worker-manager says we have 775 gecko-t/t-win7-32 instance running, https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-win7-32 only lists 12. These tasks are also backlogged on central: https://treeherder.mozilla.org/jobs?repo=mozilla-central&resultStatus=usercancel%2Crunning%2Cpending%2Crunnable

AWS reports an error:
"Error calling AWS API: We currently do not have sufficient c4.2xlarge capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get c4.2xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c."

This might be a shutdown issue similar to bug 1736329.

jmaher reduced the limits of some worker pools with few users in https://phabricator.services.mozilla.com/D130290

:jwhitlock - does this look like bug 1736329 in the logs?

Flags: needinfo?(jwhitlock)

Yes. Looking at monitor.periodic, it looks like worker scanner has hit the 100 minute timeout a few time this week, 5 times in the last 24 hours. When a worker self-terminates, it has to be scanned again by worker scanner before a replacement will be provisioned.

Here's the start times and durations of the last 24 hours:

Start Time Duration (minutes)
2021-11-03T21:20:21.356711420Z 56
2021-11-03T20:23:34.360067035Z 100
2021-11-03T18:43:14.357124360Z 55
2021-11-03T17:48:13.923863522Z 100
2021-11-03T16:07:53.891333617Z 100
2021-11-03T14:27:33.881938788Z 31
2021-11-03T13:56:20.183195949Z 96
2021-11-03T12:19:49.243386805Z 85
2021-11-03T10:54:03.856020774Z 33
2021-11-03T10:21:11.805882046Z 100
2021-11-03T08:40:51.797861436Z 68
2021-11-03T07:32:36.951586529Z 97
2021-11-03T05:54:47.181296143Z 100
2021-11-03T04:14:27.173447369Z 47
2021-11-03T03:26:52.467974893Z 39
2021-11-03T02:47:42.783643674Z 80
2021-11-03T01:27:17.954403810Z 82
2021-11-03T00:04:29.053830418Z 49
2021-11-02T23:15:08.568646433Z 65

None of these are close to the "safe" value of 5 minutes or less.

Flags: needinfo?(jwhitlock)

We have 2 workers running, both claimed 2 days ago. https://firefox-ci-tc.services.mozilla.com/worker-manager still lists 256 machines. Is this a new issue because the machines don't get removed during worker scanner iterations?

Attached image 7-32-fail.png

It is looking like instances are failing to fulling initiate. I am diving into this to see if i can find a reason why.

Also it looks like the majority of instances are spinning up and shutting down within the matter of a couple minutes.

Priority: -- → P1
Summary: gecko-t/t-win7-32 workers not getting provisioned, high count of provisioned workers not taking tasks → gecko-t/t-win7-32 workers not getting provisioned, high count of provisioned workers not taking tasks, no Windows 7 test coverage

Update. I have not made any progress on this. It does seem like the latest AMIs maybe the issue. The win 7 instances are continuously spinning up and shutting down. In the process there are no logs being generated. I tried to manual launch an instance using ami-09db78425fce79b72. It seems like it never fully initializes .

There is one instance up and running passing tests, i-099df46cc32d0053e . This instance is based off of ami-04a4865797a61b24d which was removed from ci-config in https://phabricator.services.mozilla.com/D130164. We may want to revert the changes related to win7-32 that were made in that patch. The downside would be it would have the old HG fingerprint, but we might be able to get some tests up and running and buy us sometime to figure out what happened with the last AMIs creations.

Assignee: nobody → mcornmesser
Status: NEW → ASSIGNED
Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/c7d4389d13fe revert to older win 7-32 AMI. r=releng-reviewers,jmaher

rob: Just FYI, reverting to the older AMIs seem to have done the trick. I wasn't able to find a reason on why the newer AMIs instances were failing to start and pick up tasks.

Flags: needinfo?(rthijssen)
Assignee: mcornmesser → rthijssen

closing as this was resolved by reverting win7 amis.

Flags: needinfo?(rthijssen)

Marking this as fixed as the AMIs now work.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: