pending wait times started climbing around 11am PST and it was noticed many thousands of tasks pending in the queues for multiple worker types. most impacted was gecko-t-linux* workers. relevant signal fx graph: https://app.signalfx.com/#/dashboard/Cp7oeIXAYDI?density=4&startTimeUTC=1498586347000&endTimeUTC=1498596281000 Looking into this, it appears that a provisioning iteration started, but after more than a hour never completed and there were not a lot (less than a dozen at anytime) spot requests in the ec2 manager spot requests table. Deadman snitch also reported that it had not heard from the provisioner. After restarting the provisioner at 13:10 PT, over 700 machines were spawned within a 30 minute period and dropped the pending wait times greatly. So, I'm not sure why: 1. provisioning iteratons are taking so long (or not completing at all) 2. What caused spot requests to be delayed to the point where worker types were starved
Summary: Provisioning iteration failed - instances not spawning → Provisioning iteration hung - instances not spawning
The spot requests table won't likely ever have all that much in it. That table only tracks spot requests for which we haven't received a cloud watch event for. Now that spot requests are getting fulfilled in an exceptionally short amount of time, I've been noticing that they're removed from that table before the first polling iteration even happens. I'm not sure why the provisioner seemed to freeze, but I did land a patch https://github.com/taskcluster/aws-provisioner/commit/9607f1399e5e1ef097e77215568d7fe56000be59 which makes the provisioner have much shorter iterations. It will do a maximum of 200 instances per region per iteration. Since we now have a very short iteration interval, this shouldn't have any negative impact to provisioning. The advantage of this patch is twofold: we get faster state updates in the provisioner ui and we get more frequent calculations of need. Until the /state/:workerType endpoint is changed to us the ec2-manager directly, we only get status updates at the end of an iteration. Now, we know that'll happen approximately every ~240s at most. When we're submitting a bunch of spot requests all at once, we now get a recalculation of demand every ~240s at most, which means instances requested earlier in the iteration could have a chance to work off some of the demand.
Assignee: nobody → jhford
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.