Per bug 1422184, as of about 12 hours ago we have seen gecko-t-win10-64-gpu with 500 instances provisioned (the maximum) but a rapidly growing pending. Looking at some of the workers, they've all died while running their last job, which was resolved claim-expired. So something is crashing/hanging these hosts, but they are not terminating and thus are sitting idle. We have determined that it is possible to limp along by occasionally terminating all of the workers of a given workerType, and letting the provisioner re-provision them. This is manual intervention and not very efficient, though. I'm happy to give permission to terminate all to anyone who needs it to continue this pattern.
Related to https://bugzilla.mozilla.org/show_bug.cgi?id=1372172? Rob posted a script in comment 12 (https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c12) of that bug that might be useful if the cause is the same (impaired instances).
Hypothesis - the script in https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c12 is running in a cron somewhere under a superuser aws account. Since we disabled superuser accounts yesterday, that probably broke. Jonas, can we reenable the superuser accounts until grenade is back from PTO?
Both grenade and markco use the script, running it from their laptops. Rob thought he had potentially found a fix while working on the OS theme issue (1343049?), but I'm not sure if that actually landed or if it just didn't quite work out.
i found and fixed the permissions issue and am running the cron script successfully again...
You need to log in before you can comment on or make changes to this bug.