Closed Bug 1422184 Opened 7 years ago Closed 6 years ago

Long pending queues for gpu instances November 30, 2017

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: bstack)

Details

16:10:01 <Aryx> hi, can anybody check what's up with the gpu backlog?
16:10:35 <Aryx> 4300 pending vs 230 running for win7, 657 vs 500 for win10
16:24:29 <Aryx> !t-rex: can anybody take a look at this, please?
...
16:26:54 <@dustin> I wonder if we're losing another region
16:27:01 <Aryx> win10 is at maximum capacity, not sure if two different issues (e.g. machine gets into a bad state)
Scrolling through:
https://papertrailapp.com/systems/ec2-manager/events?q=%22finished%20inserting%20spot%20request%20into%20database%20(workerType%3Dgecko-t-win7-32-gpu%2C%22&focus=873040182772383815

I see pages of:
   workerType=gecko-t-win7-32-gpu, region=eu-central-1, az=eu-central-1a, instanceType=g2.2xlarge, imageId=ami-8ef54fe1, id=sir-4svij58p, state=open, status=pending-evaluation

@jhford, is there zero round-robin?

even if we want to optimize for spot price we could just balance it, so that cheapest region gets: 5 request every time the second cheapest gets 1 request.
Or some scheme like this. Using some exponentially decreasing series to order machines on all regions, with exponential weight to the cheapest region.
Flags: needinfo?(jhford)
Closed trees for this: Windows gl + gpu instances + Windows 7 pgo reftest jobs don't run

@Taskcluster people: If you think that the trees can be reopened, please let the people in #sheriffs know. There are Softvision sheriffs on duty which hadn't much contact with tc infra issues yet, so a helping hand is welcome. Thank you in advance.
16:43:06 <@dustin> ok
16:44:56 <@dustin> region removed, only for gecko-t-win7-32-gpu
16:45:24 <@dustin> and all gecko-t-win10-64-gpu
16:45:31 <@dustin> terminated
17:00:14 <@dustin> pending #'s seem to be trending downish
As we expected last night, the pending for gecko-t-win10-64-gpu was back up just now, with 500 instances running.  I killed all the running instances, so they should start to re-provision.  I filed bug 1422295 to hand that particular issue off to releng.

The broader issue here seems to have something to do with AWS's changes this week -- bug 1422301 for that.

I'll leave this open for the needinfo.
@jhford,
from: https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/
> As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand.

hence, the price won't change when there is no more capacity in a region.
I suspect that's why we're getting hit so hard.

dustin suggests that we need to get the "capacity-oversubscribed" signal back into the provisioning logic.
So that it stops provisioning in oversubscribed regions.


In this light I think weighted preference of regions based on price is less important.
It could still be a decent tool to ensure that regions with almost the same price (relative to max price) are used to
allocated approximately the same number of machines. Thus, making us more resilient, but it's probably most important
to handle the "capacity-oversubscribed" error.
Oh, and fun something EC2 says: capacity-not-available
Yes, I'm happy to work on that, but the work was de-prioritized over the last quarters because we weren't sure when the changes to the EC2 api were going to land, or for sure what the changes were even going to be exactly.  I think this is a good thing to work on for the rest of the year and in Q1.

Specifically, dealing with the new pricing model, getting feedback from EC2 and using the new InstanceMarket (or whatever they're calling that option) and support for runInstances
Flags: needinfo?(jhford)
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.