1577837 - [aws provider] load-test the aws provider

So turns out we can only spin up 20 instances on that account and that region. I set the limit increase request. So this ticket is on hold, waiting for support to reply...

[:owlish] 🦉 PST

Assignee

Comment 2

•

5 years ago

•

Edited

lol they already approved it! that was fast

results for 100 instances: spun up ok, registered ok, check status also worked (I stopped 1 instance)(tested mainly provisioning loop and registering)
Provisioning: :44:32 to :44:32 (less than 1s). Registering: :45:47 to :46:13 (~30s)

results for 1000 instances: Error calling AWS API: RequestLimitExceeded: Request limit exceeded. in provisioning loop

[:owlish] 🦉 PST

Assignee

Comment 3

•

5 years ago

•

Edited

Also: Error calling AWS API: InsufficientInstanceCapacity: We currently do not have sufficient m5d.xlarge capacity in zones with support for 'gp2' volumes. Our system will be working on provisioning additional capacity. for worker pool with EBS volumes

[:owlish] 🦉 PST

Assignee

Comment 4

•

5 years ago

•

Edited

When I tried default m1.small instances, the results for 1000 were more satisfactory. So, apparently, it depends on the instance type
Provisioning: :17:41 to :17:42 (1 s). Registering: :20:42 to :28:53 (8 mins)
One instance was immediately terminated (Server.InternalError: Internal error on launch), no replacement was spun up (Error calling AWS API: RequestLimitExceeded: Request limit exceeded. in the logs)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 5

•

5 years ago

Wow, that's good news so far! Is the RequestLimitExceeded recorded as a WorkerPoolError, too? Can you tell which request limit that is?

[:owlish] 🦉 PST

Assignee

Comment 6

•

5 years ago

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)

Wow, that's good news so far! Is the RequestLimitExceeded recorded as a WorkerPoolError, too? Can you tell which request limit that is?

It wasn't (unless I cleaned up that table together with the workerpool and worker tables on Friday) - I'll make sure it saves these types of errors there.

As for your second question - I am currently investigating that. I'm also investigating a couple of other things which might help us with these errors

[:owlish] 🦉 PST

Assignee

Comment 7

•

5 years ago

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)
Can you tell which request limit that is?

If you mean "which request" then it's runInstances endpoint (describeInstances has a higher limit, and I don't see errors from it). If you mean "which limit" as in how many calls per unit of time - I don't know what it is exactly, but I can bet a bottle of beer that the problem here is not the request limit per se - the real problem is that there aren't enough instances in the region. The first error is Error calling AWS API: InsufficientInstanceCapacity: We currently do not have sufficient m5d.xlarge capacity in zones with support for 'gp2' volumes. Our system will be working on provisioning additional capacity. and that one is followed by RequestLimitExceeded: Request limit exceeded on the subsequent provisioning loop iterations. Apparently, retries after that particular error (InsufficientInstanceCapacity) have to be exponentially spaced.

I think there are several ways of dealing with this:

To actually implement exponential retries for that particular error (which seems to complicate the code terribly)
To experiment with idempotent requests
To experiment with requestSpotFleet/requestSpotInstances endpoints
To experiment with spot options of runInstances
To see if this problem can be solved by talking to aws support

I'm looking into them at the moment

[:owlish] 🦉 PST

Assignee

Comment 8

•

5 years ago

...and the errors don't end up in the workerPoolError table for some reason, I'll debug that

[:owlish] 🦉 PST

Assignee

Comment 9

•

5 years ago

•

Edited

I repeated the 1000 instances test with t2.nano instances. AWS provider spun up around 5000 instances, mainly because the registering takes longer time than the provisioning loop.

Another error encountered was InsufficientFreeAddressesInSubnet: insufficient free addresses to allocate 1000 addresses in subnet. We need to make sure the load is being spread over subnets/regions. Along with this, we need to make sure the requests are idempotent or something like that, otherwise it would be even worse than it is now.

I am suspending further testing to eliminate the bugs found. I'll link the tickets to this one.

[:owlish] 🦉 PST

Assignee

Updated

•

5 years ago

Depends on: 1578900

[:owlish] 🦉 PST

Assignee

Updated

•

5 years ago

Depends on: 1578902

[:owlish] 🦉 PST

Assignee

Updated

•

5 years ago

Depends on: 1578904

[:owlish] 🦉 PST

Assignee

Updated

•

5 years ago

Depends on: 1579554

[:owlish] 🦉 PST

Assignee

Updated

•

5 years ago

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Bugzilla

[aws provider] load-test the aws provider

Categories

(Taskcluster :: Services, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: owlish)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Updated

Updated

Updated