Closed
Bug 1093204
Opened 10 years ago
Closed 9 years ago
[Tiles][Back-end] Observed load ceiling between 10K and 16.5K req/sec
Categories
(Content Services Graveyard :: General, defect)
Content Services Graveyard
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: kthiessen, Assigned: mostlygeek)
Details
Links to graphs to be put in a comment further down. I've been able to get 10K RPS steady-state, but with 30 c3.large load testers running siege, we get to a 16.5K RPS peak, which does not sustain, and then fall down to a <10K RPS steady sate. I'd like other members of the team to help me understand why.
Assignee | ||
Comment 1•10 years ago
|
||
Looking at the EC2 data: - we did not scale to the max instances of our autoscaling group (so plenty there) - ELBs started exhibiting backend issues / errors - ELBs are just EC2 boxes, I believe we get 3 ELB servers (1 per Availability Zone) so ~3.3K SSL req/sec is pretty fast for a single EC2 server (even a big one) So some things we can try: - create multiple ELBs and use R53 to round robin between them - suspect each load balancer will max out at 10K SSL req/sec - we can make our EC2 servers part of multiple ELBs - not sure how this will affect the ASG health checks which depend on ELB data, but we'll see :)
Assignee | ||
Comment 2•10 years ago
|
||
OK new stack deployed. - tiles.stage.mozaws.net now points at 3 ELBS (round robin DNS). - there are 3x c3.large app servers running - autoscaling group adds servers into all 3 ELBs Load test away! :)
Reporter | ||
Comment 3•10 years ago
|
||
Beginning load test with 30 c3.larges at concurrency 10 (low), targeting 18K RPS for 60 minutes.
Reporter | ||
Comment 4•10 years ago
|
||
Command line: ./bees attack --use-siege -w 90m -c 360 -u 'https://tiles.stage.mozaws.net/v2/links/click POST {"locale":"en-US"}' [Note that this is concurrency 12 (360/30) rather than concurrency 10 above.] Test output: INFO:root:27 of 30 clients succeeded. Concurrency Level: 322 Complete requests: 79767134 Failed requests: 1193 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 14772.54 [#/sec] (mean) Time per request: 20.000 [ms] (mean) 50% response time: 20 [ms] (mean) 75% response time: 20 [ms] (mean) 90% response time: 38 [ms] (mean) 95% response time: 53 [ms] (mean) 99% response time: 110 [ms] (mean) INFO:root:The swarm is awaiting new orders. That looks pretty close to the target. Next up: 30 wasps, concurrency 10, 4 hours. If that succeeds ok, I'll set a 6-hour test overnight.
Reporter | ||
Comment 5•10 years ago
|
||
After multiple attempts with 30, 40, and even 50 c3.large wasps, I am unable to get more than about 14K req/sec sustained over more than about 10 minutes. I'm going to need someone one the server end to look into this and figure out where the bottleneck is.
Flags: needinfo?(oyiptong)
Flags: needinfo?(bwong)
Reporter | ||
Comment 6•10 years ago
|
||
30 bees, 4 hours: INFO:root:29 of 30 clients succeeded. Concurrency Level: 228 Complete requests: 207624303 Failed requests: 767 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 14418.54 [#/sec] (mean) Time per request: 20.000 [ms] (mean) 50% response time: 10 [ms] (mean) 75% response time: 20 [ms] (mean) 90% response time: 20 [ms] (mean) 95% response time: 30 [ms] (mean) 99% response time: 57 [ms] (mean) INFO:root:The swarm is awaiting new orders.
Reporter | ||
Comment 7•10 years ago
|
||
40 bees, 4 hours: INFO:root:39 of 40 clients succeeded. Concurrency Level: 307 Complete requests: 243103162 Failed requests: 1922 Non-2xx responses: 0 Total Transferred: 167772 bytes Requests per second: 16882.78 [#/sec] (mean) Time per request: 20.000 [ms] (mean) 50% response time: 20 [ms] (mean) 75% response time: 20 [ms] (mean) 90% response time: 30 [ms] (mean) 95% response time: 30 [ms] (mean) 99% response time: 70 [ms] (mean) INFO:root:The swarm is awaiting new orders.
Reporter | ||
Comment 8•10 years ago
|
||
50 bees, 4 hours: INFO:root:50 of 50 clients succeeded. Concurrency Level: 397 Complete requests: 177875301 Failed requests: 372 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 12352.83 [#/sec] (mean) Time per request: 30.000 [ms] (mean) 50% response time: 24 [ms] (mean) 75% response time: 40 [ms] (mean) 90% response time: 59 [ms] (mean) 95% response time: 74 [ms] (mean) 99% response time: 107 [ms] (mean) INFO:root:The swarm is awaiting new orders.
Comment 9•10 years ago
|
||
Is this with the multi-az load balancing or the multi-elb to one application server LB?
Flags: needinfo?(oyiptong)
Reporter | ||
Comment 10•10 years ago
|
||
Both multi-az and multi-elb, as far as I know. Ben?
Assignee | ||
Comment 11•10 years ago
|
||
Hmm... going with my gut here that: - we need MOAR ELBs, let's try with 5, suspect each ELB is good/safe for 6K to 8K SSL r/s - we need to set the max ASG size to 32 servers - since we're more CPU bounce than network bound, leaving network concurrency on EC2 onyx boxes as is If we can't break through the 20K req/sec ceiling with 5 ELBs... then it's gotta be something else. Looking at the stackdriver graphs on stage: https://app.stackdriver.com/groups/11795/stage-tiles/webhead it seems about 19K/s is a clear ceiling. So next step: MOAR ELBS!
Flags: needinfo?(bwong)
Reporter | ||
Comment 12•10 years ago
|
||
OK. Go ahead and make that happen, ping me on IRC when it's done and I'll try again.
Assignee | ||
Comment 13•10 years ago
|
||
OK we're at 6 ELBs now. Spent a bit of time updating the deployment code so it's easy for us to scale up ELBs to as many as we'd like.
Reporter | ||
Comment 14•10 years ago
|
||
This is not particularly encouraging: 20 bees, 4 hours: INFO:root:20 of 20 clients succeeded. Concurrency Level: 158 Complete requests: 97337335 Failed requests: 2572 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 6888.86 [#/sec] (mean) Time per request: 20.051 [ms] (mean) 50% response time: 20 [ms] (mean) 75% response time: 25 [ms] (mean) 90% response time: 40 [ms] (mean) 95% response time: 49 [ms] (mean) 99% response time: 88 [ms] (mean) INFO:root:The swarm is awaiting new orders. Next up: 30 bees/4 hours, to see if we have linear scale. I'm suspecting not.
Assignee | ||
Comment 15•10 years ago
|
||
FWIW: The ELB graphs shows a lot of difference between the traffic levels of each ELB. This is likely due to the bees all getting similar results from DNS. Let's try with 100+ bees which should give us more balanced traffic across all ELBs.
Reporter | ||
Comment 16•10 years ago
|
||
Launched 100 t2.medium bees at 22:52 UTC (14:52 Pacific). The graphs look fundamentally similiar to me.
Assignee | ||
Comment 17•10 years ago
|
||
Yes looks about the same for me as well. The load is still very spiky across all the ELBs. I attached a siege URL file. By using a URL file we should be able to avoid the DNS based round robin and hopefully the traffic will be less spiky. Just copy and paste this into a text file and run it like: $ bees attack --use-siege --url-file=urllist.txt (copy/paste the text blow into urllist.txt) https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"} https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"} https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"} https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"} https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"} https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/click POST {"locale":"en-US"} # GET / fetching calls https://dualstack.tiles-stage-4-ELB0-1SBN8MHTMEL7R-1286036649.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US https://dualstack.tiles-stage-4-ELB1-1CLM4R3TFYG9A-2014429211.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US https://dualstack.tiles-stage-4-ELB2-1JHYFZ5EQUIUF-661441503.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US https://dualstack.tiles-stage-4-ELB3-34KHIC5PJG3Q-268717506.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US https://dualstack.tiles-stage-4-ELB4-1X5G3N2YSOCZG-965305875.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US https://dualstack.tiles-stage-4-ELB5-1K0FEUB1PS2O8-242515712.us-east-1.elb.amazonaws.com/v2/links/fetch/en-US
Updated•10 years ago
|
Product: Mozilla Services → Content Services
Reporter | ||
Comment 18•9 years ago
|
||
How are we doing with actual load vs. tested load? Do we need to do more testing to get more headroom?
Flags: needinfo?(oyiptong)
Flags: needinfo?(bwong)
Assignee | ||
Comment 19•9 years ago
|
||
We haven't gone above 9Kr/sec. I'm not worried about it.
Flags: needinfo?(bwong)
Comment 20•9 years ago
|
||
We will increase fetch frequency at some point and it would be good to estimate by how much our load will increase and test if necessary. Can we find out what the percentage of the requests are fetch vs pings? If we don't have that information off-hand, we can compute it using disco.
Flags: needinfo?(oyiptong)
Assignee | ||
Comment 21•9 years ago
|
||
oyiptong: do we have statsd metrics for fetches and pings? If not, we should. :)
Flags: needinfo?(oyiptong)
Assignee | ||
Comment 22•9 years ago
|
||
Never mind. Looks like fetches are about 22% of all requests.
Flags: needinfo?(oyiptong)
Reporter | ||
Comment 23•9 years ago
|
||
Resolving this WONTFIX (and taking myself off of QA contact) as this work is unlikely to be needed.
Status: NEW → RESOLVED
Closed: 9 years ago
QA Contact: kthiessen → nobody
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•