Closed Bug 1645994 Opened 5 years ago Closed 5 years ago

worker pools t-win10-64 and t-win10-64-gpu-s often backlogged, tasks pending

Categories

(Infrastructure & Operations :: RelOps: Windows OS, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: grenade)

References

Details

Attachments

(2 files, 2 obsolete files)

The worker pools t-win10-64 and t-win10-64-gpu-s are often backlogged and tasks are pending (checked Try yesterday which had tasks pending for 2+ hours(
https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&from=1592172000000&to=1592344799000

This hits again today, both mentioned pools with 600+ tasks pending. Kendall, can you get this investigated and tracked?

Flags: needinfo?(klibby)
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED

i believe this was caused by todays github dns outage which causes our windows cloud image configuration to fail.
the simplest fix is to terminate idle workers freeing up capacity for new workers to spawn (only works if github is reachable from new ec2 instances).
i am running an idle worker termination script now and will rerun again in the morning (utc+3).

Flags: needinfo?(klibby)

We have seen backlog several times today (also at the moment).

Are the sizes of the pending Windows queues available and can be added to the Grafana dashboard? At the moment, it is unknown if the load is too high or the throughput had degraded.

Flags: needinfo?(dhouse)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #4)

Are the sizes of the pending Windows queues available and can be added to the Grafana dashboard? At the moment, it is unknown if the load is too high or the throughput had degraded.

:aryx, I don't understand what is missing. Is this for Azure workers? I'll look to see where that is in the provisioners, and how to add it to the monitoring.

https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?var-provisioner=gecko-t&var-workerType=t-win10-64 has the third graphic ('Workers vs Active') empty ('No data') which with default would allow to check if the level of running machines is expected or not. Thank you in advance.

We are seeing again a backlog for these Windows workers for more than 4h, and this has delayed the merge of several changesets from autoland to central.

Please have another look at this bug.

Flags: needinfo?(klibby)
Attached image image.png (obsolete) —

I've fixed the collection to now have t-win10-64 active/running counts

Flags: needinfo?(dhouse)

Thank you. Closing the bug, will reopen if this hits again.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(klibby)
Resolution: --- → WORKSFORME

The fix was for me to separate the gecko-t/ collection. The collection was not completing collection of "gecko-b gecko-t gecko-1 gecko-2 gecko-3 mobile-1 mobile-3 mobile-t" altogether. Those pools have grown and the taskcluster api calls are not as fast as before (db migration).

The issue is active again and the active worker count increases relatively slowly (e.g. starting at 03:40 UTC with more than 2x the rate compared to now).

Dave, it will likely go away when the worker count eventually catches up. Can you take a look at this next week?

Status: RESOLVED → REOPENED
Flags: needinfo?(dhouse)
Resolution: WORKSFORME → ---

Edit: or maybe Rob can take a look as it's Windows.

Flags: needinfo?(rthijssen)

whenever we have idle windows workers taking up capacity, it affects our ability to spawn new windows workers. the simplest fix is simply to terminate all idle windows workers which i achieve by running the script at: https://gist.github.com/grenade/63bf380b79b995065cb6530df34725c8#file-terminate-idle-instances-sh as a taskcluster task.

since i'm running this on sunday, whereas the backlog was identified on friday evening, the number of idle instances has already decreased naturally, due to normal worker management terminations. the script did find some idle workers to terminate though and the script run output log is visible at: https://firefox-ci-tc.services.mozilla.com/tasks/IUklQcE1QKSuF2T6nRNPdQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FIUklQcE1QKSuF2T6nRNPdQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

i only noticed this ni today but anyone with appropriate scopes can also trigger the same idle worker cleanup script by pasting the task definition into the task creator, updating the created and deadline properties and running the task. alternatively, retrigger the task at: https://firefox-ci-tc.services.mozilla.com/tasks/IUklQcE1QKSuF2T6nRNPdQ

Flags: needinfo?(rthijssen)

Dustin, is there logging to provide insight why the the pool size for these workers remains so low? E.g. there are currently 78 active t-win10-64-gpu-s workers according to Grafana while the max capacity is set to 512. Both queues have a backlog for 5+ hours.

Flags: needinfo?(dustin)

I'd check the errors display in the worker-manager view. We've been getting a bunch of errors about no spot capacity available for our own workers, so perhaps this instance type is as well.

Flags: needinfo?(dustin)

Every 1-3 hours, Error calling AWS API: The image id '[ami-0c7fbac545e94f4c4]' does not exist gets logged for win10-64.

For win10-64-gpu-s, there is Error calling AWS API: We currently do not have sufficient g3s.xlarge capacity in the Availability Zone you requested (us-west-2a). but that's 10 hours old.

It looks like the worker pool errors pagination is wrong and while each page is being sorted by date, the errors themselves are not sorted this way so if you click through to new pages you see this error actually happening every few minutes rather than every few hours.

I have filed https://github.com/taskcluster/taskcluster/issues/3238 to fix the taskcluster issue. But I think that those errors in comment 17 are the root cause of the slow provisioning. I see it being thrown on almost every provisioning loop in the logs.

Mark, are there alternatives to the current set of Windows cloud machines which get requested? At the moment, both queues have 1+K backlogged tasks accumulated over the last 3+h.

Flags: needinfo?(mcornmesser)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #19)

Mark, are there alternatives to the current set of Windows cloud machines which get requested? At the moment, both queues have 1+K backlogged tasks accumulated over the last 3+h.

I don't believe there are alternatives in this case.

Flags: needinfo?(mcornmesser)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #17)

Every 1-3 hours, Error calling AWS API: The image id '[ami-0c7fbac545e94f4c4]' does not exist gets logged for win10-64.

Nobody has addressed this part yet, so I had a look.

The ami is referenced here: https://hg.mozilla.org/ci/ci-configuration/annotate/145be2a0e9b585083943c37764f28d26c80b061f/worker-images.yml#l231

The reference hasn't changed since added in November.

Of the 4 AMIs referenced in that block, 2 are missing (eu-central-1 and us-east-1) and 2 are present (us-west-1 and us-west-2). The AMI name and description is identical between the AMIs in us-west-1 and us-west-2, so we may be able to get away with copying one of those AMIs to eu-central-1 and us-east-1 and updating the AMI ids.

Why are they missing? Well, assuming the creation dates were the same for all regions, I blame me: https://bugzilla.mozilla.org/show_bug.cgi?id=1634789#c10

:grenade - can we copy existing AMIs from a region where the exist and update ci-config with the new IDs, or is there region-specific info baked into the AMIs?

Flags: needinfo?(rthijssen)
Attachment #9166921 - Attachment is obsolete: true

(In reply to Chris Cooper [:coop] pronoun: he from comment #21)

:grenade - can we copy existing AMIs from a region where the exist and update ci-config with the new IDs, or is there region-specific info baked into the AMIs?

yes, copying will work fine. there is no regional information on existing amis.

Flags: needinfo?(rthijssen)
Pushed by jwood@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/6bcc32611fa2 replace all missing AMIs in eu-central-1 and us-east-1 r=Callek

Missing out on us-east-1 for for Windows test capacity was probably the biggest factor in pending counts here. Now that the fix is deployed, let's give it a few days and see if it gets better.

Flags: needinfo?(dhouse)
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: