tune Bedrock mod_wsgi settings for faster startup

RESOLVED FIXED

Status

Infrastructure & Operations Graveyard
WebOps: Product Delivery
RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: jakem, Assigned: jakem)

Tracking

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/572] )

(Assignee)

Description

4 years ago
I've done some testing on this, and come to some interesting conclusions:


1) As expected, WSGIImportScript has no effect on startup times if both process-group and application-group are specified in the WSGIScriptAlias declaration. So there's no point in adding it. Conversely, removing those clauses from WSGIScriptAlias does show a small increase in loading time... so it does have a bit of value.

2) The number of processes drastically affects the time it takes to start operating normally... moreso than anything else.

3) *Fewer* processes seem able to handle more traffic in bedrock's case!



I worked up a very simple "curl" test that simply hits the homepage 50 times in parallel and ran it against a specific node right after restarting apache, timing how long it takes all 50 requests to complete.

processes=16 - 24s (current)
processes=8  - 13s
processes=4  - 7s

I then ran benchmarks (httperf and ab) to verify capacity with 4, 8, and 16 processes. The results are pretty compelling... fewer processes leads to more consistent performance, and higher performance under higher load. Conceptually, this kinda makes sense... having more procs than CPU cores is likely to increase contention and may not add anything. Here's a small selection of the results (I can provide more if needed):

16:
Reply rate [replies/s]: min 78.2 avg 91.8 max 100.0 stddev 5.4 (21 samples)
Reply time [ms]: response 3725.8 transfer 14.7
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.4      0       4
Processing:    44  124  35.6    120     525
Waiting:       42  109  34.2    106     523
Total:         45  125  35.6    121     525

8:
Reply rate [replies/s]: min 95.0 avg 99.4 max 103.8 stddev 2.7 (20 samples)
Reply time [ms]: response 165.4 transfer 6.1
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    24  118  27.1    115     398
Waiting:       22  112  26.8    109     396
Total:         25  118  27.1    115     399

4:
Reply rate [replies/s]: min 97.2 avg 100.0 max 102.8 stddev 0.9 (20 samples)
Reply time [ms]: response 26.0 transfer 3.6
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    39  114  17.3    110     273
Waiting:       32  108  17.1    105     271
Total:         40  114  17.3    111     273


(suggestions in next comment)

Updated

4 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/572]
(Assignee)

Comment 1

4 years ago
I suggest:

1) We switch from 16 processes to 8. If New Relic shows no negatives to this, we should consider going even further, down to 4. Testing shows that seems to be better. In PHX1, the Bedrock nodes are all 2-core VMs, so 4 is probably getting close to the sweet spot, if indeed CPU is the primary limiting factor.

My only concern with this is any "complicated" pages that may involve disk or memory access, and thus be bound by network or memory constraints rather than CPU-limited. That's why I'm suggesting going to 8 first and checking for any negative impact.


2) We consider adding a simple loop into the Chief deploy scripts to "prime" the bedrock workers. Something like this, though I think we can do better:
https://github.com/mozilla/kitsune/blob/master/scripts/update/deploy.py#L92
This is a hack, but one that seems worthwhile. I'd rather prime the system up front by slamming it with curl rather than have our users suffer for much longer while all the workers eventually get up to speed through normal traffic.


3) We remove the unnecessary "WSGIProcessGroup" directive... we already specify this in WSGIScriptAlias, no need to add another line. This is just cleanup.



We can also explore the idea of doing fewer nodes at once (in the extreme, serial deployment instead of parallel), though I'm not convinced this will buy us anything unless it's also tied in with the load balancer removing and re-adding nodes as it goes. If we want to try this, I think it's worth a separate bug as that will involve a bit of research and some API coding work... it's not something we have boilerplate for yet.

Comment 2

4 years ago
(In reply to Jake Maul [:jakem] from comment #1)
> I suggest:
> 
> 1) We switch from 16 processes to 8. If New Relic shows no negatives to
> this, we should consider going even further, down to 4. Testing shows that
> seems to be better. In PHX1, the Bedrock nodes are all 2-core VMs, so 4 is
> probably getting close to the sweet spot, if indeed CPU is the primary
> limiting factor.

+1, Setting the number of procs to twice the number of cores generally gives
the best performance in my experience as well. Do we have quad cores in the 
SCL3 nodes? If so, can we do 4 procs in PHX1 and 8 in SCL3?

> My only concern with this is any "complicated" pages that may involve disk
> or memory access, and thus be bound by network or memory constraints rather
> than CPU-limited. That's why I'm suggesting going to 8 first and checking
> for any negative impact.
> 

+1 for being conservative with the rate of change :)

> 2) We consider adding a simple loop into the Chief deploy scripts to "prime"
> the bedrock workers. Something like this, though I think we can do better:
> https://github.com/mozilla/kitsune/blob/master/scripts/update/deploy.py#L92
> This is a hack, but one that seems worthwhile. I'd rather prime the system
> up front by slamming it with curl rather than have our users suffer for much
> longer while all the workers eventually get up to speed through normal
> traffic.

If you don't mind, I'd like discuss other options to see if we can come up with
a cleaner solution before pursuing this option.

 
> 3) We remove the unnecessary "WSGIProcessGroup" directive... we already
> specify this in WSGIScriptAlias, no need to add another line. This is just
> cleanup.

+1

> We can also explore the idea of doing fewer nodes at once (in the extreme,
> serial deployment instead of parallel), though I'm not convinced this will
> buy us anything unless it's also tied in with the load balancer removing and
> re-adding nodes as it goes. If we want to try this, I think it's worth a
> separate bug as that will involve a bit of research and some API coding
> work... it's not something we have boilerplate for yet.

Yes, I'd very much like to research and discuss this option further and would
be happy to do so in a separate bug. 


Thanks :jakem!

Updated

4 years ago
Assignee: server-ops-webops → nmaul
(Assignee)

Comment 3

4 years ago
Sadly we can't currently (easily) do separate Apache configs for SCL3 and PHX1. SCL3 nodes have 8 cores each... they're Seamicro Xeon nodes. FWIW, the general trend has been that we trust those type of nodes less and less, and we P2V them if they ever get flaky... after which they usually have fewer cores. So I wouldn't get too attached to them.

I've committed the puppet change to drop from 16 procs to 8.


We'll wait a while and see how this works out for us. If all is well and deploys are indeed faster as they should be, we may try going down to 4 procs.
(Assignee)

Comment 4

4 years ago
As far as I can tell we've had zero problems with 8 procs. Shall we drop down to 4 as planned? Per comment 0, this should further reduce the startup time from 13-14s to 6-7s, as well as providing more consistent (and slightly faster) performance.
Flags: needinfo?(jmize)

Comment 5

4 years ago
(In reply to Jake Maul [:jakem] from comment #4)
> As far as I can tell we've had zero problems with 8 procs. Shall we drop
> down to 4 as planned? Per comment 0, this should further reduce the startup
> time from 13-14s to 6-7s, as well as providing more consistent (and slightly
> faster) performance.

jakem: how sure are you that the issues we were seeing in bug 1044749 weren't exacerbated by dropping to 8 procs? Also, while the reduced startup time is nice, I'm not at all sure the faster performance would be universal: did you do your benchmarking all in PHX1 where we have only 2 core VMs, or did you also test in SCL3 where we have more cores per node?
Flags: needinfo?(jmize)
(Assignee)

Comment 6

4 years ago
We're going to stop here.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.