slow provisioning time for backfilled jobs using gecko-t-linux-large
Categories
(Infrastructure & Operations :: RelOps: General, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Unassigned)
References
Details
Attachments
(2 files, 1 obsolete file)
It was complete shortly after the request at ~10:03 UTC https://tools.taskcluster.net/groups/EE_cP2LdRqO17HOc96qYrw/tasks/QEGWgT3LQaShJEGOOW4ESw/runs/0
The backfilled jobs got there workers with delay, the longest waiting ones at ~10:40 UTC: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception%2Cusercancel%2Cretry%2Csuperseded&tochange=c670f40348ee7794f031ef7f84019592d56154cd&fromchange=0a7463d82e5f5f3cbca589baab9abc07881f5142&searchStr=linux%2Cx64%2Cdebug%2Cm%287%29&selectedJob=261970424
Grafana shows the backlog as growing from 10:00 UTC https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&refresh=1m&var-provisioner=aws-provisioner-v1&var-workerType=All&from=now-3h&to=now
Still that a production tree doesn't get machines is a) worrisome b) confusing because the merges should need those workers later.
Any idea why provisioning took long and if workers got provided to other jobs at that time what those jobs were?
:tomprince and :dustin cc'ing you both as a continuation of the gecko-t-linux-large machines not/slow getting provisioned (also with the new instance type)
In the aws-provisioner logs (https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=1099026134466719746 in Mozilla Treeherder papertrail), there are repeated failed-requests like:
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: failed to submit spot request (price=8, workerType=gecko-t-linux-large, region=us-west-1, zone=us-west-1c, instanceType=m5.large)
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: err: {
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "timeout": 2000,
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "code": "ECONNABORTED",
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "errno": "ETIME"
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: }
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: error submitting spot request (region=us-west-1, workerType=gecko-t-linux-large, instanceType=m5.large, zone=us-west-1c)
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: err: {
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "timeout": 2000,
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "code": "ECONNABORTED",
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "errno": "ETIME"
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: }
I attached a screenshot of the graph for workers vs. queue for gecko-t-linux-large and -xlarge.
-large is set to scalingRatio 0.1, and -xlarge is default, but I think the connection timeout/errors in the logs show the scalingRatio is not the cause of the slow scaling-up.
(In reply to Dave House [:dhouse] from comment #2)
In the aws-provisioner logs (https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=1099026134466719746 in Mozilla Treeherder papertrail), there are repeated failed-requests like:
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: failed to submit spot request (price=8, workerType=gecko-t-linux-large, region=us-west-1, zone=us-west-1c, instanceType=m5.large) Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: err: { Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "timeout": 2000, Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "code": "ECONNABORTED", Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "errno": "ETIME" Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: } Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: error submitting spot request (region=us-west-1, workerType=gecko-t-linux-large, instanceType=m5.large, zone=us-west-1c) Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: err: { Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "timeout": 2000, Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "code": "ECONNABORTED", Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: "errno": "ETIME" Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: }
Dustin, have you seen these errors before for aws-provisioner?
There are also repeated log entries with an ERROR level that show what looks like the pricing for different regions (but without a message explaining if it is an error):
14:22:37.631Z ERROR aws-provisioner-production: (workerType=gecko-t-linux-large, region=us-west-1, type=m5.large)
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: pricingData: {
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: "us-west-1c": 0.0391,
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: "us-west-1b": 0.0433
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: }
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: --
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: zones: [
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: "us-west-1c",
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: "us-west-1b"
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: ]
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: 14:22:37.631Z ERROR aws-provisioner-production: (workerType=gecko-t-linux-large, region=us-west-2, type=m5.large)
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: pricingData: {
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: "us-west-2d": 0.0338
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: }
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: --
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: zones: [
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: "us-west-2d"
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: ]
Comment 8•6 years ago
|
||
I have, but I don't know what they mean. They do happen often.
Its a call to the ec2-manager API. I'm not sure why it would have a 2-second timeout. That seems low! Heroku metrics say the 50th %ile response time for that service is about 2.5s, and that seems to have started about 14:05UTC.
One thing to try is restarting the ec2-manager process. I'll do that now.
Comment 9•6 years ago
|
||
And, yes, sorry, welcome to AWS provisioner and its sidekick ec2-manager. Their logging is super-confusing and ultimately not very helpful.
In that particular case, I think it was some debugging code added years ago for this particular workerType to display pricing data.
https://github.com/taskcluster/aws-provisioner/blob/af65d0cf1194c29d6d6eb98165570392108c35eb/lib/worker-type.js#L898-L904
Comment 10•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #8)
I have, but I don't know what they mean. They do happen often.
Its a call to the ec2-manager API. I'm not sure why it would have a 2-second timeout. That seems low! Heroku metrics say the 50th %ile response time for that service is about 2.5s, and that seems to have started about 14:05UTC.
One thing to try is restarting the ec2-manager process. I'll do that now.
Thank you! I see some more descriptive errors (showing internal server error from ec2) starting at 6:54pacific: https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=1099033683333120004
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1: 14:54:31.399Z ERROR aws-provisioner-production: failed to submit spot request (price=8, workerType=gecko-t-linux-large, region=us-west-1, zone=us-west-1b, instanceType=m5.large)
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1: err: {
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1: "body": {
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1: "code": "InternalServerError",
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1: "message": "Internal Server Error, incidentId: e19376f7-3800-4b8f-97ea-4f2075b8e703\n\n---\n\n* method: runInstance\n* errorCode: InternalServerError\n* statusCode: 500\n* time: 2019-08-16T14:54:31.397Z",
Comment 11•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #9)
And, yes, sorry, welcome to AWS provisioner and its sidekick ec2-manager. Their logging is super-confusing and ultimately not very helpful.
In that particular case, I think it was some debugging code added years ago for this particular workerType to display pricing data.
https://github.com/taskcluster/aws-provisioner/blob/af65d0cf1194c29d6d6eb98165570392108c35eb/lib/worker-type.js#L898-L904
That is understandable now. Thanks!
Comment 12•6 years ago
|
||
I reopened the support case with AWS, but then closed it again after Dustin explained that the service errors/requests are from aws-provisioner to ec2-manager. So the errors are 500's or timeouts from ec2-manager and not an AWS request failing.
Comment 13•6 years ago
|
||
Would resetting the scalingRatio to default increase the scaling-up? (or possibly help)
Comment 14•6 years ago
|
||
Ah, yes, that's a good idea. I think scalingRatio=0.1 means to start only 10% of the # of instances it would start by default, meaning the startup would be a lot slower (a good configuration for spiky loads of quick tasks, where a pending of 100 might be executed in a few minutes by just a few workers)
Updated•6 years ago
|
Comment 15•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #14)
Ah, yes, that's a good idea. I think scalingRatio=0.1 means to start only 10% of the # of instances it would start by default, meaning the startup would be a lot slower (a good configuration for spiky loads of quick tasks, where a pending of 100 might be executed in a few minutes by just a few workers)
Thanks! On Monday, I'll commit and push this change to ci-configuration and apply it to the workerType.
Comment 16•6 years ago
|
||
:bstack (low-priority) could you include this change in the next ci-admin apply and commit/push this change to ci-configuration? (I don't have access)
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 17•6 years ago
|
||
Comment 18•6 years ago
|
||
Just applied this now, thanks.
Comment 19•6 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #17)
Comment on attachment 9086082 [details] [diff] [review]
reset scalingRatio to default for gecko-t-linux-largeThis repo also supports phabricator!
But landed:
https://treeherder.mozilla.org/#/jobs?repo=ci-
configuration&revision=16c96d9d502a9e5d57157c324564dc6741d3d456
Thanks!
I verified the scalingRatio is now 0 (https://tools.taskcluster.net/aws-provisioner/gecko-t-linux-large/view (scalingRatio:0))
Description
•