Closed Bug 1574453 Opened 6 years ago Closed 6 years ago

slow provisioning time for backfilled jobs using gecko-t-linux-large

Categories

(Infrastructure & Operations :: RelOps: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: aryx, Unassigned)

References

Details

Attachments

(2 files, 1 obsolete file)

Assignee: nobody → relops
Component: CIDuty → RelOps: General
QA Contact: dlabici → klibby

:tomprince and :dustin cc'ing you both as a continuation of the gecko-t-linux-large machines not/slow getting provisioned (also with the new instance type)

In the aws-provisioner logs (https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=1099026134466719746 in Mozilla Treeherder papertrail), there are repeated failed-requests like:

Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: failed to submit spot request (price=8, workerType=gecko-t-linux-large, region=us-west-1, zone=us-west-1c, instanceType=m5.large) 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     err: { 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "timeout": 2000, 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "code": "ECONNABORTED", 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "errno": "ETIME" 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     } 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: error submitting spot request (region=us-west-1, workerType=gecko-t-linux-large, instanceType=m5.large, zone=us-west-1c) 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     err: { 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "timeout": 2000, 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "code": "ECONNABORTED", 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "errno": "ETIME" 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     } 

I attached a screenshot of the graph for workers vs. queue for gecko-t-linux-large and -xlarge.

Attachment #9086065 - Attachment is obsolete: true

-large is set to scalingRatio 0.1, and -xlarge is default, but I think the connection timeout/errors in the logs show the scalingRatio is not the cause of the slow scaling-up.

(In reply to Dave House [:dhouse] from comment #2)

In the aws-provisioner logs (https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=1099026134466719746 in Mozilla Treeherder papertrail), there are repeated failed-requests like:

Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: failed to submit spot request (price=8, workerType=gecko-t-linux-large, region=us-west-1, zone=us-west-1c, instanceType=m5.large) 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     err: { 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "timeout": 2000, 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "code": "ECONNABORTED", 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "errno": "ETIME" 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     } 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1: 14:24:36.464Z ERROR aws-provisioner-production: error submitting spot request (region=us-west-1, workerType=gecko-t-linux-large, instanceType=m5.large, zone=us-west-1c) 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     err: { 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "timeout": 2000, 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "code": "ECONNABORTED", 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:       "errno": "ETIME" 
Aug 16 07:24:36 taskcluster-aws-provisioner2 app/provisioner.1:     } 

Dustin, have you seen these errors before for aws-provisioner?

Flags: needinfo?(dustin)

There are also repeated log entries with an ERROR level that show what looks like the pricing for different regions (but without a message explaining if it is an error):

14:22:37.631Z ERROR aws-provisioner-production:  (workerType=gecko-t-linux-large, region=us-west-1, type=m5.large) 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     pricingData: { 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:       "us-west-1c": 0.0391, 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:       "us-west-1b": 0.0433 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     } 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     -- 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     zones: [ 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:       "us-west-1c", 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:       "us-west-1b" 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     ] 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1: 14:22:37.631Z ERROR aws-provisioner-production:  (workerType=gecko-t-linux-large, region=us-west-2, type=m5.large) 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     pricingData: { 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:       "us-west-2d": 0.0338 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     } 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     -- 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     zones: [ 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:       "us-west-2d" 
Aug 16 07:22:37 taskcluster-aws-provisioner2 app/provisioner.1:     ] 

I have, but I don't know what they mean. They do happen often.

Its a call to the ec2-manager API. I'm not sure why it would have a 2-second timeout. That seems low! Heroku metrics say the 50th %ile response time for that service is about 2.5s, and that seems to have started about 14:05UTC.

One thing to try is restarting the ec2-manager process. I'll do that now.

Flags: needinfo?(dustin)

And, yes, sorry, welcome to AWS provisioner and its sidekick ec2-manager. Their logging is super-confusing and ultimately not very helpful.

In that particular case, I think it was some debugging code added years ago for this particular workerType to display pricing data.
https://github.com/taskcluster/aws-provisioner/blob/af65d0cf1194c29d6d6eb98165570392108c35eb/lib/worker-type.js#L898-L904

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #8)

I have, but I don't know what they mean. They do happen often.

Its a call to the ec2-manager API. I'm not sure why it would have a 2-second timeout. That seems low! Heroku metrics say the 50th %ile response time for that service is about 2.5s, and that seems to have started about 14:05UTC.

One thing to try is restarting the ec2-manager process. I'll do that now.

Thank you! I see some more descriptive errors (showing internal server error from ec2) starting at 6:54pacific: https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?focus=1099033683333120004


Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1: 14:54:31.399Z ERROR aws-provisioner-production: failed to submit spot request (price=8, workerType=gecko-t-linux-large, region=us-west-1, zone=us-west-1b, instanceType=m5.large) 
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1:     err: { 
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1:       "body": { 
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1:         "code": "InternalServerError", 
Aug 16 07:54:31 taskcluster-aws-provisioner2 app/provisioner.1:         "message": "Internal Server Error, incidentId: e19376f7-3800-4b8f-97ea-4f2075b8e703\n\n---\n\n* method:     runInstance\n* errorCode:  InternalServerError\n* statusCode: 500\n* time:       2019-08-16T14:54:31.397Z", 

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #9)

And, yes, sorry, welcome to AWS provisioner and its sidekick ec2-manager. Their logging is super-confusing and ultimately not very helpful.

In that particular case, I think it was some debugging code added years ago for this particular workerType to display pricing data.
https://github.com/taskcluster/aws-provisioner/blob/af65d0cf1194c29d6d6eb98165570392108c35eb/lib/worker-type.js#L898-L904

That is understandable now. Thanks!

I reopened the support case with AWS, but then closed it again after Dustin explained that the service errors/requests are from aws-provisioner to ec2-manager. So the errors are 500's or timeouts from ec2-manager and not an AWS request failing.

Would resetting the scalingRatio to default increase the scaling-up? (or possibly help)

Attachment #9086082 - Flags: review?(wcosta)
Attachment #9086082 - Flags: review?(dustin)

Ah, yes, that's a good idea. I think scalingRatio=0.1 means to start only 10% of the # of instances it would start by default, meaning the startup would be a lot slower (a good configuration for spiky loads of quick tasks, where a pending of 100 might be executed in a few minutes by just a few workers)

Attachment #9086082 - Flags: review?(dustin) → review+

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #14)

Ah, yes, that's a good idea. I think scalingRatio=0.1 means to start only 10% of the # of instances it would start by default, meaning the startup would be a lot slower (a good configuration for spiky loads of quick tasks, where a pending of 100 might be executed in a few minutes by just a few workers)

Thanks! On Monday, I'll commit and push this change to ci-configuration and apply it to the workerType.

Flags: needinfo?(dhouse)

:bstack (low-priority) could you include this change in the next ci-admin apply and commit/push this change to ci-configuration? (I don't have access)

Flags: needinfo?(dhouse) → needinfo?(bstack)
Flags: needinfo?(bstack) → needinfo?(dustin)
Flags: needinfo?(dustin) → needinfo?(bugspam.Callek)
Attachment #9086082 - Flags: review?(wcosta) → review+
Comment on attachment 9086082 [details] [diff] [review] reset scalingRatio to default for gecko-t-linux-large This repo also supports phabricator! But landed: https://treeherder.mozilla.org/#/jobs?repo=ci-configuration&revision=16c96d9d502a9e5d57157c324564dc6741d3d456

Just applied this now, thanks.

Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(bugspam.Callek)
Resolution: --- → FIXED

(In reply to Justin Wood (:Callek) from comment #17)

Comment on attachment 9086082 [details] [diff] [review]
reset scalingRatio to default for gecko-t-linux-large

This repo also supports phabricator!

But landed:
https://treeherder.mozilla.org/#/jobs?repo=ci-
configuration&revision=16c96d9d502a9e5d57157c324564dc6741d3d456

Thanks!
I verified the scalingRatio is now 0 (https://tools.taskcluster.net/aws-provisioner/gecko-t-linux-large/view (scalingRatio:0))

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: