Closed Bug 1283180 Opened 8 years ago Closed 8 years ago

Increase production capacity for support.mozilla.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgmize, Assigned: rwatson)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3162])

As discussed with :w0ts0n and :fox2mike in IRC, we are seeing increased slowdowns and errors on sumo in response to increased traffic, and the current resources are insufficient to handle the load. We are hitting max CPU, so increasing the CPU count on the existing VMs and/or increasing the number of web nodes should help with this.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3162]
How does +2 CPU per node sound?

The VM team have agreed to this change and I believe this is the simplest method to increase capacity without a lot of work (vs setting up new nodes). 

Is that acceptable to you?
(In reply to Ryan Watson [:w0ts0n] from comment #1)
> How does +2 CPU per node sound?

works for me, thanks :)
As we continue to look at CPU utilization on these hosts, we're seeing things that make us question whether a simple CPU increase is going to actually *fix* things.  

Over the last month, we see a baseline of about 30-40% CPU utilization - and then suddenly we see three events (one on June 25th, and two on June 28th) that show a sudden leap to 100% CPU.  

and I do mean *sudden*.  The concern here is that this behavior often is caused by a mis-configured or erroring piece of software suddenly taking ANY and ALL resources it can get.  Adding 2 CPUs to that will simply allow it to eat those as well - with no improvement to user experience.  I've heard rumors that this service is moribund, possibly going away in Q4 - so I understand not wanting to spend real time on it - but if you have notes on why it's having these spikes, we love data to back up resource requests.

Note that we're still on-board with more CPUs, I just want to make sure we're prepared if/when that doesn't solve the problem.  (if it does, YAY!)
Assignee: server-ops-webops → rwatson
(In reply to Chris Knowles [:cknowles] from comment #3)
> As we continue to look at CPU utilization on these hosts, we're seeing
> things that make us question whether a simple CPU increase is going to
> actually *fix* things.  
> 
> Over the last month, we see a baseline of about 30-40% CPU utilization - and
> then suddenly we see three events (one on June 25th, and two on June 28th)
> that show a sudden leap to 100% CPU.  
> 
> and I do mean *sudden*.  The concern here is that this behavior often is
> caused by a mis-configured or erroring piece of software suddenly taking ANY
> and ALL resources it can get.  Adding 2 CPUs to that will simply allow it to
> eat those as well - with no improvement to user experience.  I've heard
> rumors that this service is moribund, possibly going away in Q4 - so I
> understand not wanting to spend real time on it - but if you have notes on
> why it's having these spikes, we love data to back up resource requests.
> 
> Note that we're still on-board with more CPUs, I just want to make sure
> we're prepared if/when that doesn't solve the problem.  (if it does, YAY!)

The only data I have is that apache started eating up all the available CPU simultaneously with a significant throughput increase. It's certainly possible that the apache config and/or other changes could be made to better handle higher throughput without increasing the virtual resources dedicated to the site, and I don't want to discourage anyone from taking a deeper look at things if they have the time and inclination, but my understanding is that the person time involved would be more difficult to justify and more expensive than simply throwing more CPUs at the problem in the hopes that it doesn't come back.
One apache config change I would recommend would be to double the number of apache worker processes after doubling the number of CPUs-- this may have already been part of your plans, but I thought it would be best to make it explicit here.
So, yes, people time is hard to come by, no doubt - concern here is that if it is a worker process going nuts, then NO number of CPUs will keep you out of harms way.  

From conversations with :w0ts0n on the 28th during the second spike that day (which, thankfully as of this morning hadn't recurred) was that on (and here, correct me if I'm wrong w0ts0n) support1.webapp there were 45 requests, and 81 idle workers - which doesn't sound like it was using all the worker processes available to it.  (which might just mean there are too many workers for the available CPU, but might have other concerns too)

Again, we're gonna go forward with the +2 CPU, but anything over 4 CPUs (where you are now) doesn't increase the abilities of the VM in the way you'd expect - it increases the overhead of the VM getting it's slice of time to run on the hardware ... so we can't throw CPUs at the problem indefinitely.
alright support{1-4}.webapp.phx1 have had their CPU count increased to 6, all VM tracking sheets have been updated.
work completed
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.