Closed Bug 1577837 Opened 5 years ago Closed 5 years ago

[aws provider] load-test the aws provider

Categories

(Taskcluster :: Services, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: owlish)

References

Details

Verify that the aws provider can handle 1000's of instances simultaneously.

Status: NEW → ASSIGNED
Summary: load-test the aws provider → [aws provider] load-test the aws provider

So turns out we can only spin up 20 instances on that account and that region. I set the limit increase request. So this ticket is on hold, waiting for support to reply...

lol they already approved it! that was fast

results for 100 instances: spun up ok, registered ok, check status also worked (I stopped 1 instance)(tested mainly provisioning loop and registering)
Provisioning: :44:32 to :44:32 (less than 1s). Registering: :45:47 to :46:13 (~30s)

results for 1000 instances: Error calling AWS API: RequestLimitExceeded: Request limit exceeded. in provisioning loop

Also: Error calling AWS API: InsufficientInstanceCapacity: We currently do not have sufficient m5d.xlarge capacity in zones with support for 'gp2' volumes. Our system will be working on provisioning additional capacity. for worker pool with EBS volumes

When I tried default m1.small instances, the results for 1000 were more satisfactory. So, apparently, it depends on the instance type
Provisioning: :17:41 to :17:42 (1 s). Registering: :20:42 to :28:53 (8 mins)
One instance was immediately terminated (Server.InternalError: Internal error on launch), no replacement was spun up (Error calling AWS API: RequestLimitExceeded: Request limit exceeded. in the logs)

Wow, that's good news so far! Is the RequestLimitExceeded recorded as a WorkerPoolError, too? Can you tell which request limit that is?

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)

Wow, that's good news so far! Is the RequestLimitExceeded recorded as a WorkerPoolError, too? Can you tell which request limit that is?

It wasn't (unless I cleaned up that table together with the workerpool and worker tables on Friday) - I'll make sure it saves these types of errors there.

As for your second question - I am currently investigating that. I'm also investigating a couple of other things which might help us with these errors

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)
Can you tell which request limit that is?

If you mean "which request" then it's runInstances endpoint (describeInstances has a higher limit, and I don't see errors from it). If you mean "which limit" as in how many calls per unit of time - I don't know what it is exactly, but I can bet a bottle of beer that the problem here is not the request limit per se - the real problem is that there aren't enough instances in the region. The first error is Error calling AWS API: InsufficientInstanceCapacity: We currently do not have sufficient m5d.xlarge capacity in zones with support for 'gp2' volumes. Our system will be working on provisioning additional capacity. and that one is followed by RequestLimitExceeded: Request limit exceeded on the subsequent provisioning loop iterations. Apparently, retries after that particular error (InsufficientInstanceCapacity) have to be exponentially spaced.

I think there are several ways of dealing with this:

  • To actually implement exponential retries for that particular error (which seems to complicate the code terribly)
  • To experiment with idempotent requests
  • To experiment with requestSpotFleet/requestSpotInstances endpoints
  • To experiment with spot options of runInstances
  • To see if this problem can be solved by talking to aws support

I'm looking into them at the moment

...and the errors don't end up in the workerPoolError table for some reason, I'll debug that

I repeated the 1000 instances test with t2.nano instances. AWS provider spun up around 5000 instances, mainly because the registering takes longer time than the provisioning loop.

Another error encountered was InsufficientFreeAddressesInSubnet: insufficient free addresses to allocate 1000 addresses in subnet. We need to make sure the load is being spread over subnets/regions. Along with this, we need to make sure the requests are idempotent or something like that, otherwise it would be even worse than it is now.

I am suspending further testing to eliminate the bugs found. I'll link the tickets to this one.

Depends on: 1578900
Depends on: 1578902
Depends on: 1578904
Depends on: 1579554
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.