[Tiles][Back-end] Observed load ceiling between 10K and 16.5K req/sec

RESOLVED WONTFIX

Status

Content Services Graveyard
General
RESOLVED WONTFIX
4 years ago
3 years ago

People

(Reporter: kthiessen, Assigned: mostlygeek)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
Links to graphs to be put in a comment further down.

I've been able to get 10K RPS steady-state, but with 30 c3.large load testers running siege, we get to a 16.5K RPS peak, which does not sustain, and then fall down to a <10K RPS steady sate.

I'd like other members of the team to help me understand why.
(Reporter)

Updated

4 years ago
Blocks: 1093202
(Assignee)

Comment 1

4 years ago
Looking at the EC2 data: 

- we did not scale to the max instances of our autoscaling group (so plenty there) 
- ELBs started exhibiting backend issues / errors
- ELBs are just EC2 boxes, I believe we get 3 ELB servers (1 per Availability Zone) so ~3.3K SSL req/sec is pretty fast for a single EC2 server (even a big one) 

So some things we can try: 

- create multiple ELBs and use R53 to round robin between them
  - suspect each load balancer will max out at 10K SSL req/sec 
  - we can make our EC2 servers part of multiple ELBs 
- not sure how this will affect the ASG health checks which depend on ELB data, but we'll see :)
(Assignee)

Comment 2

4 years ago
OK new stack deployed. 

- tiles.stage.mozaws.net now points at 3 ELBS (round robin DNS). 
- there are 3x c3.large app servers running 
- autoscaling group adds servers into all 3 ELBs

Load test away! :)
(Reporter)

Comment 3

4 years ago
Beginning load test with 30 c3.larges at concurrency 10 (low), targeting 18K RPS for 60 minutes.
(Reporter)

Comment 4

4 years ago
Command line:

./bees attack --use-siege -w 90m -c 360 -u 'https://tiles.stage.mozaws.net/v2/links/click POST {"locale":"en-US"}'

[Note that this is concurrency 12 (360/30) rather than concurrency 10 above.]

Test output:

INFO:root:27 of 30 clients succeeded.
Concurrency Level:      322
Complete requests:      79767134
Failed requests:        1193
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    14772.54 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      38 [ms] (mean)
95% response time:      53 [ms] (mean)
99% response time:      110 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

That looks pretty close to the target.


Next up: 30 wasps, concurrency 10, 4 hours.  If that succeeds ok, I'll set a 6-hour test overnight.
(Reporter)

Comment 5

4 years ago
After multiple attempts with 30, 40, and even 50 c3.large wasps, I am unable to get more than about 14K req/sec sustained over more than about 10 minutes.  I'm going to need someone one the server end to look into this and figure out where the bottleneck is.
Flags: needinfo?(oyiptong)
Flags: needinfo?(bwong)
(Reporter)

Comment 6

4 years ago
30 bees, 4 hours:

INFO:root:29 of 30 clients succeeded.
Concurrency Level:      228
Complete requests:      207624303
Failed requests:        767
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    14418.54 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      10 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      20 [ms] (mean)
95% response time:      30 [ms] (mean)
99% response time:      57 [ms] (mean)
INFO:root:The swarm is awaiting new orders.
(Reporter)

Comment 7

4 years ago
40 bees, 4 hours:

INFO:root:39 of 40 clients succeeded.
Concurrency Level:      307
Complete requests:      243103162
Failed requests:        1922
Non-2xx responses:      0
Total Transferred:      167772 bytes
Requests per second:    16882.78 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      30 [ms] (mean)
95% response time:      30 [ms] (mean)
99% response time:      70 [ms] (mean)
INFO:root:The swarm is awaiting new orders.
(Reporter)

Comment 8

4 years ago
50 bees, 4 hours:

INFO:root:50 of 50 clients succeeded.
Concurrency Level:      397
Complete requests:      177875301
Failed requests:        372
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    12352.83 [#/sec] (mean)
Time per request:       30.000 [ms] (mean)
50% response time:      24 [ms] (mean)
75% response time:      40 [ms] (mean)
90% response time:      59 [ms] (mean)
95% response time:      74 [ms] (mean)
99% response time:      107 [ms] (mean)
INFO:root:The swarm is awaiting new orders.
Is this with the multi-az load balancing or the multi-elb to one application server LB?
Flags: needinfo?(oyiptong)
(Reporter)

Comment 10

4 years ago
Both multi-az and multi-elb, as far as I know.  Ben?
(Assignee)

Comment 11

4 years ago
Hmm... going with my gut here that: 

- we need MOAR ELBs, let's try with 5, suspect each ELB is good/safe for 6K to 8K SSL r/s
- we need to set the max ASG size to 32 servers
- since we're more CPU bounce than network bound, leaving network concurrency on EC2 onyx boxes as is

If we can't break through the 20K req/sec ceiling with 5 ELBs... then it's gotta be something else. 
Looking at the stackdriver graphs on stage: https://app.stackdriver.com/groups/11795/stage-tiles/webhead it seems about 19K/s is a clear ceiling. 

So next step: MOAR ELBS!
Flags: needinfo?(bwong)
(Reporter)

Comment 12

4 years ago
OK.  Go ahead and make that happen, ping me on IRC when it's done and I'll try again.
(Assignee)

Comment 13

4 years ago
OK we're at 6 ELBs now. Spent a bit of time updating the deployment code so it's easy for us to scale up ELBs to as many as we'd like.
(Reporter)

Comment 14

4 years ago
This is not particularly encouraging:

20 bees, 4 hours:

INFO:root:20 of 20 clients succeeded.
Concurrency Level:      158
Complete requests:      97337335
Failed requests:        2572
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    6888.86 [#/sec] (mean)
Time per request:       20.051 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      25 [ms] (mean)
90% response time:      40 [ms] (mean)
95% response time:      49 [ms] (mean)
99% response time:      88 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

Next up: 30 bees/4 hours, to see if we have linear scale.  I'm suspecting not.
(Assignee)

Comment 15

4 years ago
FWIW: The ELB graphs shows a lot of difference between the traffic levels of each ELB. This is likely due to the bees all getting similar results from DNS. Let's try with 100+ bees which should give us more balanced traffic across all ELBs.
(Reporter)

Comment 16

4 years ago
Launched 100 t2.medium bees at 22:52 UTC (14:52 Pacific).  The graphs look fundamentally similiar to me.
(Assignee)

Comment 17

4 years ago
Yes looks about the same for me as well. The load is still very spiky across all the ELBs. 
I attached a siege URL file. By using a URL file we should be able to avoid the DNS based round robin and hopefully the traffic will be less spiky. 

Just copy and paste this into a text file and run it like: 

$ bees attack --use-siege --url-file=urllist.txt

(copy/paste the text blow into urllist.txt)


https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}

# GET / fetching calls
https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
Product: Mozilla Services → Content Services
(Reporter)

Updated

4 years ago
No longer blocks: 1093202
(Reporter)

Comment 18

3 years ago
How are we doing with actual load vs. tested load?  Do we need to do more testing to get more headroom?
Flags: needinfo?(oyiptong)
Flags: needinfo?(bwong)
(Assignee)

Comment 19

3 years ago
We haven't gone above 9Kr/sec. I'm not worried about it.
Flags: needinfo?(bwong)
We will increase fetch frequency at some point and it would be good to estimate by how much our load will increase and test if necessary.

Can we find out what the percentage of the requests are fetch vs pings?
If we don't have that information off-hand, we can compute it using disco.
Flags: needinfo?(oyiptong)
(Assignee)

Comment 21

3 years ago
oyiptong: do we have statsd  metrics for fetches and pings? If not, we should. :)
Flags: needinfo?(oyiptong)
(Assignee)

Comment 22

3 years ago
Never mind. Looks like fetches are about 22% of all requests.
Flags: needinfo?(oyiptong)
(Reporter)

Comment 23

3 years ago
Resolving this WONTFIX (and taking myself off of QA contact) as this work is unlikely to be needed.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
QA Contact: kthiessen → nobody
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.