Closed Bug 1093204 Opened 10 years ago Closed 9 years ago

[Tiles][Back-end] Observed load ceiling between 10K and 16.5K req/sec

Categories

(Content Services Graveyard :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: kthiessen, Assigned: mostlygeek)

Details

Links to graphs to be put in a comment further down.

I've been able to get 10K RPS steady-state, but with 30 c3.large load testers running siege, we get to a 16.5K RPS peak, which does not sustain, and then fall down to a <10K RPS steady sate.

I'd like other members of the team to help me understand why.
Blocks: 1093202
Looking at the EC2 data: 

- we did not scale to the max instances of our autoscaling group (so plenty there) 
- ELBs started exhibiting backend issues / errors
- ELBs are just EC2 boxes, I believe we get 3 ELB servers (1 per Availability Zone) so ~3.3K SSL req/sec is pretty fast for a single EC2 server (even a big one) 

So some things we can try: 

- create multiple ELBs and use R53 to round robin between them
  - suspect each load balancer will max out at 10K SSL req/sec 
  - we can make our EC2 servers part of multiple ELBs 
- not sure how this will affect the ASG health checks which depend on ELB data, but we'll see :)
OK new stack deployed. 

- tiles.stage.mozaws.net now points at 3 ELBS (round robin DNS). 
- there are 3x c3.large app servers running 
- autoscaling group adds servers into all 3 ELBs

Load test away! :)
Beginning load test with 30 c3.larges at concurrency 10 (low), targeting 18K RPS for 60 minutes.
Command line:

./bees attack --use-siege -w 90m -c 360 -u 'https://tiles.stage.mozaws.net/v2/links/click POST {"locale":"en-US"}'

[Note that this is concurrency 12 (360/30) rather than concurrency 10 above.]

Test output:

INFO:root:27 of 30 clients succeeded.
Concurrency Level:      322
Complete requests:      79767134
Failed requests:        1193
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    14772.54 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      38 [ms] (mean)
95% response time:      53 [ms] (mean)
99% response time:      110 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

That looks pretty close to the target.


Next up: 30 wasps, concurrency 10, 4 hours.  If that succeeds ok, I'll set a 6-hour test overnight.
After multiple attempts with 30, 40, and even 50 c3.large wasps, I am unable to get more than about 14K req/sec sustained over more than about 10 minutes.  I'm going to need someone one the server end to look into this and figure out where the bottleneck is.
Flags: needinfo?(oyiptong)
Flags: needinfo?(bwong)
30 bees, 4 hours:

INFO:root:29 of 30 clients succeeded.
Concurrency Level:      228
Complete requests:      207624303
Failed requests:        767
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    14418.54 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      10 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      20 [ms] (mean)
95% response time:      30 [ms] (mean)
99% response time:      57 [ms] (mean)
INFO:root:The swarm is awaiting new orders.
40 bees, 4 hours:

INFO:root:39 of 40 clients succeeded.
Concurrency Level:      307
Complete requests:      243103162
Failed requests:        1922
Non-2xx responses:      0
Total Transferred:      167772 bytes
Requests per second:    16882.78 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      30 [ms] (mean)
95% response time:      30 [ms] (mean)
99% response time:      70 [ms] (mean)
INFO:root:The swarm is awaiting new orders.
50 bees, 4 hours:

INFO:root:50 of 50 clients succeeded.
Concurrency Level:      397
Complete requests:      177875301
Failed requests:        372
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    12352.83 [#/sec] (mean)
Time per request:       30.000 [ms] (mean)
50% response time:      24 [ms] (mean)
75% response time:      40 [ms] (mean)
90% response time:      59 [ms] (mean)
95% response time:      74 [ms] (mean)
99% response time:      107 [ms] (mean)
INFO:root:The swarm is awaiting new orders.
Is this with the multi-az load balancing or the multi-elb to one application server LB?
Flags: needinfo?(oyiptong)
Both multi-az and multi-elb, as far as I know.  Ben?
Hmm... going with my gut here that: 

- we need MOAR ELBs, let's try with 5, suspect each ELB is good/safe for 6K to 8K SSL r/s
- we need to set the max ASG size to 32 servers
- since we're more CPU bounce than network bound, leaving network concurrency on EC2 onyx boxes as is

If we can't break through the 20K req/sec ceiling with 5 ELBs... then it's gotta be something else. 
Looking at the stackdriver graphs on stage: https://app.stackdriver.com/groups/11795/stage-tiles/webhead it seems about 19K/s is a clear ceiling. 

So next step: MOAR ELBS!
Flags: needinfo?(bwong)
OK.  Go ahead and make that happen, ping me on IRC when it's done and I'll try again.
OK we're at 6 ELBs now. Spent a bit of time updating the deployment code so it's easy for us to scale up ELBs to as many as we'd like.
This is not particularly encouraging:

20 bees, 4 hours:

INFO:root:20 of 20 clients succeeded.
Concurrency Level:      158
Complete requests:      97337335
Failed requests:        2572
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    6888.86 [#/sec] (mean)
Time per request:       20.051 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      25 [ms] (mean)
90% response time:      40 [ms] (mean)
95% response time:      49 [ms] (mean)
99% response time:      88 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

Next up: 30 bees/4 hours, to see if we have linear scale.  I'm suspecting not.
FWIW: The ELB graphs shows a lot of difference between the traffic levels of each ELB. This is likely due to the bees all getting similar results from DNS. Let's try with 100+ bees which should give us more balanced traffic across all ELBs.
Launched 100 t2.medium bees at 22:52 UTC (14:52 Pacific).  The graphs look fundamentally similiar to me.
Yes looks about the same for me as well. The load is still very spiky across all the ELBs. 
I attached a siege URL file. By using a URL file we should be able to avoid the DNS based round robin and hopefully the traffic will be less spiky. 

Just copy and paste this into a text file and run it like: 

$ bees attack --use-siege --url-file=urllist.txt

(copy/paste the text blow into urllist.txt)


https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}

# GET / fetching calls
https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
Product: Mozilla Services → Content Services
No longer blocks: 1093202
How are we doing with actual load vs. tested load?  Do we need to do more testing to get more headroom?
Flags: needinfo?(oyiptong)
Flags: needinfo?(bwong)
We haven't gone above 9Kr/sec. I'm not worried about it.
Flags: needinfo?(bwong)
We will increase fetch frequency at some point and it would be good to estimate by how much our load will increase and test if necessary.

Can we find out what the percentage of the requests are fetch vs pings?
If we don't have that information off-hand, we can compute it using disco.
Flags: needinfo?(oyiptong)
oyiptong: do we have statsd  metrics for fetches and pings? If not, we should. :)
Flags: needinfo?(oyiptong)
Never mind. Looks like fetches are about 22% of all requests.
Flags: needinfo?(oyiptong)
Resolving this WONTFIX (and taking myself off of QA contact) as this work is unlikely to be needed.
Status: NEW → RESOLVED
Closed: 9 years ago
QA Contact: kthiessen → nobody
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.