Closed Bug 1336536 Opened 7 years ago Closed 7 years ago

Autoscaling on CPU usage for Antenna -stage env

Categories

(Cloud Services :: Operations: Antenna, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: miles, Assigned: miles)

Details

As we approach a production ready environment for Antenna, we need to have automatic autoscaling in place to handle outsized load. Our preliminary analysis concluded that Antenna is predominately CPU-bound, so our initial autoscaling configuration will be based on CPU load for Antenna instances.

Some points and observations:

1. AWS autoscaling should roughly evenly distribute incoming requests, so it should be reasonable to do this autoscaling based on average CPU load across the Antenna autoscaling group.

2. In our first-pass single instance load-test, it was observed that high CPU usage (i.e. 100%) will cause Antenna to fail to respond to requests to the ELB healthcheck endpoint (/__lbheartbeat__). Our ELB healthcheck determines which instances in the ASG should be retained, so this is a problem - high load on an Antenna node can cause it to be unhealthy, being unhealthy will cause the node to be terminated. As such, we need a cooldown period for the autoscaling group before unhealthy instances are terminated. This is an available feature of AWS autoscaling.

3. If possible, an extension of #2 would result in the following behavior in case of high load:
  - Antenna is experiencing high load, e.g. average ASG CPU load of >60-70%.
  - Additional nodes are added one by one at an interval until average ASG CPU load decreases to an acceptable level (e.g. 50%)

  - In the case of a single Antenna node being overloaded to the point where it fails healthcheck the node will not receive connections until healthy. The ASG should wait to kill the instance per #2, and the instance should recover to a healthy state after exhausting its queue. Exhausting the queue should not take long, however it is important that the number of _healthy_ instances be the metric that we scale based on, not total number of instances in the ASG, which could account for unhealthy instances that are pending termination.

Immediate work:

Initial implementation of average ASG CPU based autoscaling
Implement a cooldown period before instance termination for the ASG
Ensure that healthy instances in the ELB is the scaling factor for the ASG
(In reply to Miles Crabill [:miles] from comment #0)
> 1. AWS autoscaling should roughly evenly distribute incoming requests
Should be: ELB
Upon reading some AWS literature, I have reached the following conclusions:

  - With connection draining enabled, autoscaling events will wait for the _first_ of the following to happen before terminating an instance marked unhealthy:
    - completion of all in-flight requests (aka all open connections to Antenna being closed, which should only happen after all uploads to S3 have succeeded and Antenna has returned crash-ids)
    - ELB connection draining timeout (set to 30s going forward, 15s before)
      - this could be increased further

  - This means that if we have a case where an Antenna instance is unhealthy, i.e. not responding to ELB healthcheck in a timely fashion (unhealthy threshold is failing 2 checks), it has at most 30s to finish all of the work it is doing
    - In the event of S3 uploads failing and an un-uploaded crash queue building on an Antenna instance to the point where it is unhealthy, those crashes _could_ be lost if the connection draining timeout is hit

  - AWS autoscaling will terminate an unhealthy instance, and _then_ create a replacement instance

  - It is difficult / problematic to have an instance in an unhealthy state return to a healthy state before it is terminated (this goes against the model of AWS autoscaling, apparently)

  - Properly configured autoscaling on CPU usage is essential to the uptime of the service because of the case where maximized CPU causes instances to fail to respond to ELB healthcheck requests, causing the instances to be unhealthy and terminated. If this was to happen to all of the instances at the same time, they would all be terminated and replaced, and the service would be down until their replacements come up properly.

  - From conversation, we are assuming that uploads to S3 failing enough to back-up Antenna is unlikely enough that we are not going to plan for it (!). As such, the above is reasonable.
Just to clarify, this statement isn't quite right:

>     - completion of all in-flight requests (aka all open connections to Antenna being closed, which should only happen after all uploads to S3 have succeeded and Antenna has returned crash-ids)

Antenna has an on_post method which accepts incoming crashes, puts them in a queue to be saved to s3 and then returns a crash id and ends the HTTP conversation. Then later, a separate coroutine will try and save the crash to s3. Thus Antenna could have 0 incoming connections, but still be working on saving crashes to s3.

I don't think that affects the plan, though.
Ah, I remember having that discussion. In this particular case that makes it more possible for un-uploaded crashes to be on an unhealthy => terminating Antenna instance. That means that it is even more imperative that Antenna instances with un-uploaded crashes do not fail health checks...
Autoscaling on CPU is live in stage and will be live in production. Marking this fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.