1422184 - Long pending queues for gpu instances November 30, 2017

Assignee

Description

•

7 years ago

16:10:01 <Aryx> hi, can anybody check what's up with the gpu backlog?
16:10:35 <Aryx> 4300 pending vs 230 running for win7, 657 vs 500 for win10
16:24:29 <Aryx> !t-rex: can anybody take a look at this, please?
...
16:26:54 <@dustin> I wonder if we're losing another region
16:27:01 <Aryx> win10 is at maximum capacity, not sure if two different issues (e.g. machine gets into a bad state)

Brian Stack [:bstack]

Assignee

Comment 1

•

7 years ago

Interesting logs found by jonas:

https://papertrailapp.com/systems/ec2-manager/events?q=%22finished%20inserting%20spot%20request%20into%20database%20(workerType%3Dgecko-t-win7-32-gpu%2C%22&focus=873040289345454131

Jonas Finnemann Jensen (:jonasfj)

Comment 2

•

7 years ago

Scrolling through:
https://papertrailapp.com/systems/ec2-manager/events?q=%22finished%20inserting%20spot%20request%20into%20database%20(workerType%3Dgecko-t-win7-32-gpu%2C%22&focus=873040182772383815

I see pages of:
   workerType=gecko-t-win7-32-gpu, region=eu-central-1, az=eu-central-1a, instanceType=g2.2xlarge, imageId=ami-8ef54fe1, id=sir-4svij58p, state=open, status=pending-evaluation

@jhford, is there zero round-robin?

even if we want to optimize for spot price we could just balance it, so that cheapest region gets: 5 request every time the second cheapest gets 1 request.
Or some scheme like this. Using some exponentially decreasing series to order machines on all regions, with exponential weight to the cheapest region.

Jonas Finnemann Jensen (:jonasfj)

Updated

•

7 years ago

Flags: needinfo?(jhford)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 3

•

7 years ago

Closed trees for this: Windows gl + gpu instances + Windows 7 pgo reftest jobs don't run

@Taskcluster people: If you think that the trees can be reopened, please let the people in #sheriffs know. There are Softvision sheriffs on duty which hadn't much contact with tc infra issues yet, so a helping hand is welcome. Thank you in advance.

Brian Stack [:bstack]

Assignee

Comment 4

•

7 years ago

16:43:06 <@dustin> ok
16:44:56 <@dustin> region removed, only for gecko-t-win7-32-gpu
16:45:24 <@dustin> and all gecko-t-win10-64-gpu
16:45:31 <@dustin> terminated

Brian Stack [:bstack]

Assignee

Comment 5

•

7 years ago

17:00:14 <@dustin> pending #'s seem to be trending downish

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

7 years ago

As we expected last night, the pending for gecko-t-win10-64-gpu was back up just now, with 500 instances running.  I killed all the running instances, so they should start to re-provision.  I filed bug 1422295 to hand that particular issue off to releng.

The broader issue here seems to have something to do with AWS's changes this week -- bug 1422301 for that.

I'll leave this open for the needinfo.

Jonas Finnemann Jensen (:jonasfj)

Comment 7

•

7 years ago

@jhford,
from: https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/
> As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand.

hence, the price won't change when there is no more capacity in a region.
I suspect that's why we're getting hit so hard.

dustin suggests that we need to get the "capacity-oversubscribed" signal back into the provisioning logic.
So that it stops provisioning in oversubscribed regions.


In this light I think weighted preference of regions based on price is less important.
It could still be a decent tool to ensure that regions with almost the same price (relative to max price) are used to
allocated approximately the same number of machines. Thus, making us more resilient, but it's probably most important
to handle the "capacity-oversubscribed" error.

Jonas Finnemann Jensen (:jonasfj)

Comment 8

•

7 years ago

Oh, and fun something EC2 says: capacity-not-available

John Ford [:jhford] CET/CEST Berlin Time

Comment 9

•

7 years ago

Yes, I'm happy to work on that, but the work was de-prioritized over the last quarters because we weren't sure when the changes to the EC2 api were going to land, or for sure what the changes were even going to be exactly.  I think this is a good thing to work on for the rest of the year and in Q1.

Specifically, dealing with the new pricing model, getting feedback from EC2 and using the new InstanceMarket (or whatever they're calling that option) and support for runInstances

Flags: needinfo?(jhford)

Brian Stack [:bstack]

Assignee

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Operations → Operations and Service Requests

Bugzilla

Quick Search

Long pending queues for gpu instances November 30, 2017

Categories

(Taskcluster :: Operations and Service Requests, task)

Tracking

(Not tracked)

People

(Reporter: bstack, Assigned: bstack)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated