Closed Bug 1314740 Opened 8 years ago Closed 7 years ago

load test antenna

Categories

(Socorro :: Antenna, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: mbrandt)

References

Details

(Whiteboard: [e2e-tests])

We need to build a load test for Antenna that we can run periodically to understand Antenna's performance under load and also extrapolate what values we should use to coerce the system into autoscaling to handle load.

This issue covers figuring out what tools to use for load testing, how we should set it up and implementing it.
I had done some thinking on load testing already.

From https://github.com/mozilla/antenna/issues/85#issue-183429588:

> We can send the same crash over and over again. If we don't specify a uuid
> in the crash, it'll get saved with different crash ids. We should probably
> either send 50% compressed and 50% compressed (flip-flop between the
> two--not random) OR send them all compressed (that's the harder of the two).
> 
> We need Antenna to perform as good as the current Socorro collector. We can
> use the crashmover.save_raw_crash metrics from Datadog for Socorro to
> determine a baseline.
> 
> We can use the health check endpoints to determine how well Antenna is doing.


From https://github.com/mozilla/antenna/issues/85#issuecomment-256405893:

> I talked with @rhelmer yesterday about this a bit. He mentioned this:
> 
> https://github.com/rhelmer/socorro-benchmark
> 
> He also mentioned that he'd use wrk (https://github.com/wg/wrk) this time
> around.
> 
> The goals for load testing should include:
> 
> 1. Can Antenna handle as many as the current collector?
> 2. What's the requests/min Antenna can handle? Try to saturate an Antenna
>    instance so that it's getting more incoming crashes than outgoing crashes
>    such that it can fill up and then OOM. We can use the requests/min to base
>    appropriate levels for production and what kicks off autoscaling.
> 3. Can Antenna handle "heavy load" for 24 hours? 48 hours? How does memory
>    usage look over a long period of time?
Assigning this to Matt to take the lead on.
Assignee: nobody → mbrandt
Miles mentioned goad might be interesting: https://goad.io/#demo
cc:ing Miles so he can follow along.
Whiteboard: [e2e-tests]
I was able to successfully setup goad.io to work with AWS Lambda. I'll work through setting up the scenarios noted in comment 1.
Where's this bug at? Do we have a functioning load test system, yet?

If so, I'd like to test it out against -dev this week to see how everything works.

If not, what can I do to move this along faster?
I'm just now coming back online. Let's schedule some time this week and run an initial complement of the tests.
Met with willkg on Tuesday and we reviewed the current outlay of the load test suite. The tests currently use ailoads and the helper methods exposed in mini_poster.py.

The test suite is written and in a usable state. I will continue to clean up the code a bit next week and then land it in a repo, likely QA's. In January we can schedule a drop of the latest antenna code to stage and begin experimenting with gathering sizing data.

The current test scenarios that are ready include:

Test uploading compressed dumps in this range:
* 100k
* 150k

Test uploading uncompressed dumps in this range:
* 400k
* 4mb
* 20mb

Future nice to have work that we've identified that doesn't interfere with our current testing includes: 
* https://github.com/mozilla/antenna/issues/139
* https://github.com/mozilla/antenna/issues/140
Awesome! Thank you, Matt!
I did a light round of load testing last last week to sanity check Antenna's architecture and find out how it behaves in regards to requests/min, memory usage and s3 queue size. I did a write-up:

https://docs.google.com/a/mozilla.com/document/d/1ugPq6CQjKrxYeH1sfDd8YhZm85r6-HHRvYD7zQaRMFQ/edit?usp=sharing

Based on that, I determined that it's likely the case that Antenna's architecture is good enough to go forward.

While Matt was testing the load test system, we noticed Antenna would fall over pretty fast after nodes would get marked unhealthy. That led to some infrastructure changes. Plus we added tracking of how long it takes a crash to get handled by Antenna and the parts saved to s3.

We're at a point now where we're good for a load test.

The load test we want to run next should answer the following questions:

1. Is Antenna as configured equivalent to Socorro collector normal in regards to requests/min?

2. Is Antenna as configured equivalent to Socorro collector peak (3x) in regards to requests/min?

3. Is Antenna as configured equivalent to Socorro collector sla (10x) in regards to requests/min?

4. Can Antenna handle an hour of load at these various load points?

5. Is the current configuration of Antenna appropriate for production? If Antenna can handle 10x, then we should be fine.

6. Does Antenna scale appropriately? Does it add nodes before nodes become unhealthy and have to be replaced?


Given that, the load test we want to run next has the following properties:

1. Use a representative load similar to what Socorro collector is getting now.

That's probably something like:

   1. 400k uncompressed 85%
   2. 100k compressed 10%
   3. 1.5mb uncompressed 5%

This is roughly in-line with what we're seeing in the crash report sizes dashboard in Datadog.

https://app.datadoghq.com/dash/244846/crashreport-sizes

2. Be able to send 1x load, 3x load and 10x load where x is 1500 requests/minute. Each request is an HTTP POST with a valid crash payload.

   * 1x load: 1500 requests/minute/node
   * 3x load: 4500 requests/minute/node
   * 10x load: 15000 requests/minute

Socorro normal day peak is 1500 requests/minute. Socorro currently has 6 collector nodes.

Example normal day:

https://app.datadoghq.com/dash/65215/socorro-prod?live=false&page=0&is_auto=false&from_ts=1486930838000&to_ts=1487031151320&tile_size=m

3. Be able to track the following:

   1. number of nodes in service
   2. number of health/unhealthy nodes
   3. average and max s3 queue size
   4. median, 95% and max "crash report time" (the total time it takes
      between antenna getting the crash report and saving the parts to s3)
   5. CPU per node
   6. memory usage per node
   7. number of requests being received by Antenna

Further, the load test system should be able to tell us:

   1. all the crash ids it sent out
   2. the total number of requests it sent over what period of time


I'm pretty sure that covers what we need at this point.

Matt: Does that make sense? Is this missing anything?
Flags: needinfo?(mbrandt)
Thanks willkg: This looks spot on in terms of what we've been discussing.

There's one area I'm still working on figuring out how to do, I'm not sure how to capture the returned crashids. I've reached out to Tarek and he's helping me solve this.

Total number of requests is built into the reporting that ailoads provides. When we switch to using Molotov this reporting doesn't exist but I'm told we can use statsd to get at this data.

I've captured the testplan in this Google doc https://docs.google.com/document/d/15RVC-DcawSuWJP2eqI7wH_hau86c_TUx6AdgEEr_69s/edit?usp=sharing

The basic path that the testing is following:
1) Understand Antenna's behavior if deployed to a cluster size of one (with no autoscaling).
2) Once we understand what 1 node can handle scale the cluster to a size that can handle 1x and then 10x the average load of prod's current collector.
Flags: needinfo?(mbrandt)
In comment #10, item 2 is messed up. A clearer item 2 is more like this:


2. Be able to send 1x load, 3x load and 10x load. Each request is an HTTP POST with a valid breakpad crash report payload.

   * 1x load:  1500 requests/minute
   * 3x load:  4500 requests/minute
   * 10x load: 15000 requests/minute

Socorro normal day peak is 1500 requests/minute across collector 6 nodes, so x is 1500 requests/minute and if you're testing against just one node, then divide by 6.

Sorry about that.
Moving this bug to fixed, we captured each test run and stack information in this document. For now load testing is done. Well done team and thank you for helping pinpoint the testing needs and making a testing stack readily available.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in before you can comment on or make changes to this bug.