1093204 - [Tiles][Back-end] Observed load ceiling between 10K and 16.5K req/sec

Reporter

Description

•

10 years ago

Links to graphs to be put in a comment further down.

I've been able to get 10K RPS steady-state, but with 30 c3.large load testers running siege, we get to a 16.5K RPS peak, which does not sustain, and then fall down to a <10K RPS steady sate.

I'd like other members of the team to help me understand why.

Karl Thiessen [:kthiessen, he/him]

Reporter

Updated

•

10 years ago

Blocks: 1093202

Benson Wong [:mostlygeek]

Assignee

Comment 1

•

10 years ago

Looking at the EC2 data: 

- we did not scale to the max instances of our autoscaling group (so plenty there) 
- ELBs started exhibiting backend issues / errors
- ELBs are just EC2 boxes, I believe we get 3 ELB servers (1 per Availability Zone) so ~3.3K SSL req/sec is pretty fast for a single EC2 server (even a big one) 

So some things we can try: 

- create multiple ELBs and use R53 to round robin between them
  - suspect each load balancer will max out at 10K SSL req/sec 
  - we can make our EC2 servers part of multiple ELBs 
- not sure how this will affect the ASG health checks which depend on ELB data, but we'll see :)

Benson Wong [:mostlygeek]

Assignee

Comment 2

•

10 years ago

OK new stack deployed. 

- tiles.stage.mozaws.net now points at 3 ELBS (round robin DNS). 
- there are 3x c3.large app servers running 
- autoscaling group adds servers into all 3 ELBs

Load test away! :)

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 3

•

10 years ago

Beginning load test with 30 c3.larges at concurrency 10 (low), targeting 18K RPS for 60 minutes.

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 4

•

10 years ago

Command line:

./bees attack --use-siege -w 90m -c 360 -u 'https://tiles.stage.mozaws.net/v2/links/click POST {"locale":"en-US"}'

[Note that this is concurrency 12 (360/30) rather than concurrency 10 above.]

Test output:

INFO:root:27 of 30 clients succeeded.
Concurrency Level:      322
Complete requests:      79767134
Failed requests:        1193
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    14772.54 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      38 [ms] (mean)
95% response time:      53 [ms] (mean)
99% response time:      110 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

That looks pretty close to the target.


Next up: 30 wasps, concurrency 10, 4 hours.  If that succeeds ok, I'll set a 6-hour test overnight.

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 5

•

10 years ago

After multiple attempts with 30, 40, and even 50 c3.large wasps, I am unable to get more than about 14K req/sec sustained over more than about 10 minutes.  I'm going to need someone one the server end to look into this and figure out where the bottleneck is.

Flags: needinfo?(oyiptong)

Flags: needinfo?(bwong)

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 6

•

10 years ago

30 bees, 4 hours:

INFO:root:29 of 30 clients succeeded.
Concurrency Level:      228
Complete requests:      207624303
Failed requests:        767
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    14418.54 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      10 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      20 [ms] (mean)
95% response time:      30 [ms] (mean)
99% response time:      57 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 7

•

10 years ago

40 bees, 4 hours:

INFO:root:39 of 40 clients succeeded.
Concurrency Level:      307
Complete requests:      243103162
Failed requests:        1922
Non-2xx responses:      0
Total Transferred:      167772 bytes
Requests per second:    16882.78 [#/sec] (mean)
Time per request:       20.000 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      20 [ms] (mean)
90% response time:      30 [ms] (mean)
95% response time:      30 [ms] (mean)
99% response time:      70 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 8

•

10 years ago

50 bees, 4 hours:

INFO:root:50 of 50 clients succeeded.
Concurrency Level:      397
Complete requests:      177875301
Failed requests:        372
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    12352.83 [#/sec] (mean)
Time per request:       30.000 [ms] (mean)
50% response time:      24 [ms] (mean)
75% response time:      40 [ms] (mean)
90% response time:      59 [ms] (mean)
95% response time:      74 [ms] (mean)
99% response time:      107 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

Olivier Yiptong [:oyiptong]

Comment 9

•

10 years ago

Is this with the multi-az load balancing or the multi-elb to one application server LB?

Flags: needinfo?(oyiptong)

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 10

•

10 years ago

Both multi-az and multi-elb, as far as I know.  Ben?

Benson Wong [:mostlygeek]

Assignee

Comment 11

•

10 years ago

Hmm... going with my gut here that: 

- we need MOAR ELBs, let's try with 5, suspect each ELB is good/safe for 6K to 8K SSL r/s
- we need to set the max ASG size to 32 servers
- since we're more CPU bounce than network bound, leaving network concurrency on EC2 onyx boxes as is

If we can't break through the 20K req/sec ceiling with 5 ELBs... then it's gotta be something else. 
Looking at the stackdriver graphs on stage: https://app.stackdriver.com/groups/11795/stage-tiles/webhead it seems about 19K/s is a clear ceiling. 

So next step: MOAR ELBS!

Flags: needinfo?(bwong)

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 12

•

10 years ago

OK.  Go ahead and make that happen, ping me on IRC when it's done and I'll try again.

Benson Wong [:mostlygeek]

Assignee

Comment 13

•

10 years ago

OK we're at 6 ELBs now. Spent a bit of time updating the deployment code so it's easy for us to scale up ELBs to as many as we'd like.

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 14

•

10 years ago

This is not particularly encouraging:

20 bees, 4 hours:

INFO:root:20 of 20 clients succeeded.
Concurrency Level:      158
Complete requests:      97337335
Failed requests:        2572
Non-2xx responses:      0
Total Transferred:      0 bytes
Requests per second:    6888.86 [#/sec] (mean)
Time per request:       20.051 [ms] (mean)
50% response time:      20 [ms] (mean)
75% response time:      25 [ms] (mean)
90% response time:      40 [ms] (mean)
95% response time:      49 [ms] (mean)
99% response time:      88 [ms] (mean)
INFO:root:The swarm is awaiting new orders.

Next up: 30 bees/4 hours, to see if we have linear scale.  I'm suspecting not.

Benson Wong [:mostlygeek]

Assignee

Comment 15

•

10 years ago

FWIW: The ELB graphs shows a lot of difference between the traffic levels of each ELB. This is likely due to the bees all getting similar results from DNS. Let's try with 100+ bees which should give us more balanced traffic across all ELBs.

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 16

•

10 years ago

Launched 100 t2.medium bees at 22:52 UTC (14:52 Pacific).  The graphs look fundamentally similiar to me.

Benson Wong [:mostlygeek]

Assignee

Comment 17

•

10 years ago

Yes looks about the same for me as well. The load is still very spiky across all the ELBs. 
I attached a siege URL file. By using a URL file we should be able to avoid the DNS based round robin and hopefully the traffic will be less spiky. 

Just copy and paste this into a text file and run it like: 

$ bees attack --use-siege --url-file=urllist.txt

(copy/paste the text blow into urllist.txt)


https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}
https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"}

# GET / fetching calls
https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: Mozilla Services → Content Services

Karl Thiessen [:kthiessen, he/him]

Reporter

Updated

•

10 years ago

No longer blocks: 1093202

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 18

•

9 years ago

How are we doing with actual load vs. tested load?  Do we need to do more testing to get more headroom?

Flags: needinfo?(oyiptong)

Flags: needinfo?(bwong)

Benson Wong [:mostlygeek]

Assignee

Comment 19

•

9 years ago

We haven't gone above 9Kr/sec. I'm not worried about it.

Flags: needinfo?(bwong)

Olivier Yiptong [:oyiptong]

Comment 20

•

9 years ago

We will increase fetch frequency at some point and it would be good to estimate by how much our load will increase and test if necessary.

Can we find out what the percentage of the requests are fetch vs pings?
If we don't have that information off-hand, we can compute it using disco.

Flags: needinfo?(oyiptong)

Benson Wong [:mostlygeek]

Assignee

Comment 21

•

9 years ago

oyiptong: do we have statsd  metrics for fetches and pings? If not, we should. :)

Flags: needinfo?(oyiptong)

Benson Wong [:mostlygeek]

Assignee

Comment 22

•

9 years ago

Never mind. Looks like fetches are about 22% of all requests.

Flags: needinfo?(oyiptong)

Karl Thiessen [:kthiessen, he/him]

Reporter

Comment 23

•

9 years ago

Resolving this WONTFIX (and taking myself off of QA contact) as this work is unlikely to be needed.

Status: NEW → RESOLVED

Closed: 9 years ago

QA Contact: kthiessen → nobody

Resolution: --- → WONTFIX