[aws provider] load-test the aws provider
Categories
(Taskcluster :: Services, task)
Tracking
(Not tracked)
People
(Reporter: dustin, Assigned: owlish)
References
Details
Verify that the aws provider can handle 1000's of instances simultaneously.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
So turns out we can only spin up 20 instances on that account and that region. I set the limit increase request. So this ticket is on hold, waiting for support to reply...
Assignee | ||
Comment 2•5 years ago
•
|
||
lol they already approved it! that was fast
results for 100 instances: spun up ok, registered ok, check status also worked (I stopped 1 instance)(tested mainly provisioning loop and registering)
Provisioning: :44:32 to :44:32 (less than 1s). Registering: :45:47 to :46:13 (~30s)
results for 1000 instances: Error calling AWS API: RequestLimitExceeded: Request limit exceeded.
in provisioning loop
Assignee | ||
Comment 3•5 years ago
•
|
||
Also: Error calling AWS API: InsufficientInstanceCapacity: We currently do not have sufficient m5d.xlarge capacity in zones with support for 'gp2' volumes. Our system will be working on provisioning additional capacity.
for worker pool with EBS volumes
Assignee | ||
Comment 4•5 years ago
•
|
||
When I tried default m1.small
instances, the results for 1000 were more satisfactory. So, apparently, it depends on the instance type
Provisioning: :17:41 to :17:42 (1 s). Registering: :20:42 to :28:53 (8 mins)
One instance was immediately terminated (Server.InternalError: Internal error on launch
), no replacement was spun up (Error calling AWS API: RequestLimitExceeded: Request limit exceeded.
in the logs)
Reporter | ||
Comment 5•5 years ago
|
||
Wow, that's good news so far! Is the RequestLimitExceeded recorded as a WorkerPoolError, too? Can you tell which request limit that is?
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)
Wow, that's good news so far! Is the RequestLimitExceeded recorded as a WorkerPoolError, too? Can you tell which request limit that is?
It wasn't (unless I cleaned up that table together with the workerpool and worker tables on Friday) - I'll make sure it saves these types of errors there.
As for your second question - I am currently investigating that. I'm also investigating a couple of other things which might help us with these errors
Assignee | ||
Comment 7•5 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)
Can you tell which request limit that is?
If you mean "which request" then it's runInstances
endpoint (describeInstances
has a higher limit, and I don't see errors from it). If you mean "which limit" as in how many calls per unit of time - I don't know what it is exactly, but I can bet a bottle of beer that the problem here is not the request limit per se - the real problem is that there aren't enough instances in the region. The first error is Error calling AWS API: InsufficientInstanceCapacity: We currently do not have sufficient m5d.xlarge capacity in zones with support for 'gp2' volumes. Our system will be working on provisioning additional capacity.
and that one is followed by RequestLimitExceeded: Request limit exceeded
on the subsequent provisioning loop iterations. Apparently, retries after that particular error (InsufficientInstanceCapacity) have to be exponentially spaced.
I think there are several ways of dealing with this:
- To actually implement exponential retries for that particular error (which seems to complicate the code terribly)
- To experiment with idempotent requests
- To experiment with
requestSpotFleet
/requestSpotInstances
endpoints - To experiment with spot options of
runInstances
- To see if this problem can be solved by talking to aws support
I'm looking into them at the moment
Assignee | ||
Comment 8•5 years ago
|
||
...and the errors don't end up in the workerPoolError table for some reason, I'll debug that
Assignee | ||
Comment 9•5 years ago
•
|
||
I repeated the 1000 instances test with t2.nano instances. AWS provider spun up around 5000 instances, mainly because the registering takes longer time than the provisioning loop.
Another error encountered was InsufficientFreeAddressesInSubnet: insufficient free addresses to allocate 1000 addresses in subnet
. We need to make sure the load is being spread over subnets/regions. Along with this, we need to make sure the requests are idempotent or something like that, otherwise it would be even worse than it is now.
I am suspending further testing to eliminate the bugs found. I'll link the tickets to this one.
Assignee | ||
Updated•5 years ago
|
Description
•