Closed
Bug 1422184
Opened 7 years ago
Closed 6 years ago
Long pending queues for gpu instances November 30, 2017
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bstack, Assigned: bstack)
Details
16:10:01 <Aryx> hi, can anybody check what's up with the gpu backlog? 16:10:35 <Aryx> 4300 pending vs 230 running for win7, 657 vs 500 for win10 16:24:29 <Aryx> !t-rex: can anybody take a look at this, please? ... 16:26:54 <@dustin> I wonder if we're losing another region 16:27:01 <Aryx> win10 is at maximum capacity, not sure if two different issues (e.g. machine gets into a bad state)
Assignee | ||
Comment 1•7 years ago
|
||
Interesting logs found by jonas: https://papertrailapp.com/systems/ec2-manager/events?q=%22finished%20inserting%20spot%20request%20into%20database%20(workerType%3Dgecko-t-win7-32-gpu%2C%22&focus=873040289345454131
Comment 2•7 years ago
|
||
Scrolling through: https://papertrailapp.com/systems/ec2-manager/events?q=%22finished%20inserting%20spot%20request%20into%20database%20(workerType%3Dgecko-t-win7-32-gpu%2C%22&focus=873040182772383815 I see pages of: workerType=gecko-t-win7-32-gpu, region=eu-central-1, az=eu-central-1a, instanceType=g2.2xlarge, imageId=ami-8ef54fe1, id=sir-4svij58p, state=open, status=pending-evaluation @jhford, is there zero round-robin? even if we want to optimize for spot price we could just balance it, so that cheapest region gets: 5 request every time the second cheapest gets 1 request. Or some scheme like this. Using some exponentially decreasing series to order machines on all regions, with exponential weight to the cheapest region.
Updated•7 years ago
|
Flags: needinfo?(jhford)
Comment 3•7 years ago
|
||
Closed trees for this: Windows gl + gpu instances + Windows 7 pgo reftest jobs don't run @Taskcluster people: If you think that the trees can be reopened, please let the people in #sheriffs know. There are Softvision sheriffs on duty which hadn't much contact with tc infra issues yet, so a helping hand is welcome. Thank you in advance.
Assignee | ||
Comment 4•7 years ago
|
||
16:43:06 <@dustin> ok 16:44:56 <@dustin> region removed, only for gecko-t-win7-32-gpu 16:45:24 <@dustin> and all gecko-t-win10-64-gpu 16:45:31 <@dustin> terminated
Assignee | ||
Comment 5•7 years ago
|
||
17:00:14 <@dustin> pending #'s seem to be trending downish
Comment 6•7 years ago
|
||
As we expected last night, the pending for gecko-t-win10-64-gpu was back up just now, with 500 instances running. I killed all the running instances, so they should start to re-provision. I filed bug 1422295 to hand that particular issue off to releng. The broader issue here seems to have something to do with AWS's changes this week -- bug 1422301 for that. I'll leave this open for the needinfo.
Comment 7•7 years ago
|
||
@jhford, from: https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/ > As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand. hence, the price won't change when there is no more capacity in a region. I suspect that's why we're getting hit so hard. dustin suggests that we need to get the "capacity-oversubscribed" signal back into the provisioning logic. So that it stops provisioning in oversubscribed regions. In this light I think weighted preference of regions based on price is less important. It could still be a decent tool to ensure that regions with almost the same price (relative to max price) are used to allocated approximately the same number of machines. Thus, making us more resilient, but it's probably most important to handle the "capacity-oversubscribed" error.
Comment 8•7 years ago
|
||
Oh, and fun something EC2 says: capacity-not-available
Comment 9•7 years ago
|
||
Yes, I'm happy to work on that, but the work was de-prioritized over the last quarters because we weren't sure when the changes to the EC2 api were going to land, or for sure what the changes were even going to be exactly. I think this is a good thing to work on for the rest of the year and in Q1. Specifically, dealing with the new pricing model, getting feedback from EC2 and using the new InstanceMarket (or whatever they're calling that option) and support for runInstances
Flags: needinfo?(jhford)
Assignee | ||
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•