gecko-t/t-win7-32 workers not getting provisioned, high count of provisioned workers not taking tasks, no Windows 7 test coverage
Categories
(Release Engineering :: Firefox-CI Administration, defect, P1)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: grenade)
References
Details
Attachments
(2 files)
https://firefox-ci-tc.services.mozilla.com/worker-manager says we have 775 gecko-t/t-win7-32 instance running, https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-win7-32 only lists 12. These tasks are also backlogged on central: https://treeherder.mozilla.org/jobs?repo=mozilla-central&resultStatus=usercancel%2Crunning%2Cpending%2Crunnable
AWS reports an error:
"Error calling AWS API: We currently do not have sufficient c4.2xlarge capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get c4.2xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c."
This might be a shutdown issue similar to bug 1736329.
jmaher reduced the limits of some worker pools with few users in https://phabricator.services.mozilla.com/D130290
Comment 1•4 years ago
|
||
:jwhitlock - does this look like bug 1736329 in the logs?
Comment 2•4 years ago
|
||
Yes. Looking at monitor.periodic
, it looks like worker scanner has hit the 100 minute timeout a few time this week, 5 times in the last 24 hours. When a worker self-terminates, it has to be scanned again by worker scanner before a replacement will be provisioned.
Here's the start times and durations of the last 24 hours:
Start Time | Duration (minutes) |
---|---|
2021-11-03T21:20:21.356711420Z | 56 |
2021-11-03T20:23:34.360067035Z | 100 |
2021-11-03T18:43:14.357124360Z | 55 |
2021-11-03T17:48:13.923863522Z | 100 |
2021-11-03T16:07:53.891333617Z | 100 |
2021-11-03T14:27:33.881938788Z | 31 |
2021-11-03T13:56:20.183195949Z | 96 |
2021-11-03T12:19:49.243386805Z | 85 |
2021-11-03T10:54:03.856020774Z | 33 |
2021-11-03T10:21:11.805882046Z | 100 |
2021-11-03T08:40:51.797861436Z | 68 |
2021-11-03T07:32:36.951586529Z | 97 |
2021-11-03T05:54:47.181296143Z | 100 |
2021-11-03T04:14:27.173447369Z | 47 |
2021-11-03T03:26:52.467974893Z | 39 |
2021-11-03T02:47:42.783643674Z | 80 |
2021-11-03T01:27:17.954403810Z | 82 |
2021-11-03T00:04:29.053830418Z | 49 |
2021-11-02T23:15:08.568646433Z | 65 |
None of these are close to the "safe" value of 5 minutes or less.
![]() |
Reporter | |
Comment 3•4 years ago
|
||
We have 2 workers running, both claimed 2 days ago. https://firefox-ci-tc.services.mozilla.com/worker-manager still lists 256 machines. Is this a new issue because the machines don't get removed during worker scanner iterations?
Comment 4•4 years ago
|
||
It is looking like instances are failing to fulling initiate. I am diving into this to see if i can find a reason why.
Comment 5•4 years ago
|
||
Also it looks like the majority of instances are spinning up and shutting down within the matter of a couple minutes.
![]() |
Reporter | |
Updated•4 years ago
|
Comment 6•4 years ago
|
||
Update. I have not made any progress on this. It does seem like the latest AMIs maybe the issue. The win 7 instances are continuously spinning up and shutting down. In the process there are no logs being generated. I tried to manual launch an instance using ami-09db78425fce79b72. It seems like it never fully initializes .
There is one instance up and running passing tests, i-099df46cc32d0053e . This instance is based off of ami-04a4865797a61b24d which was removed from ci-config in https://phabricator.services.mozilla.com/D130164. We may want to revert the changes related to win7-32 that were made in that patch. The downside would be it would have the old HG fingerprint, but we might be able to get some tests up and running and buy us sometime to figure out what happened with the last AMIs creations.
Comment 7•4 years ago
|
||
Updated•4 years ago
|
Comment 9•4 years ago
|
||
rob: Just FYI, reverting to the older AMIs seem to have done the trick. I wasn't able to find a reason on why the newer AMIs instances were failing to start and pick up tasks.
Comment hidden (Intermittent Failures Robot) |
Comment 11•4 years ago
|
||
Another way to view this data is Graphana:
https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals
Updated•3 years ago
|
Assignee | ||
Comment 12•3 years ago
|
||
closing as this was resolved by reverting win7 amis.
Comment 13•3 years ago
|
||
Marking this as fixed as the AMIs now work.
Description
•