Closed Bug 901547 Opened 11 years ago Closed 11 years ago

Make custom % SWAP alert for database systems we can assign temporarily.

Categories

(mozilla.org Graveyard :: Server Operations, task, P2)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bjohnson, Assigned: ericz)

References

Details

Swap alerted on buildbot1 today.

Since we can't restart the service or fail it over (both cause a small "blip" of downtime that has been deemed unacceptable by CAB) until an approved window, we need a way to be able to still monitor these no-downtime systems once their swap alert goes off, without either a.) adjusting the swap alert for all hosts in the hostgroup or b.) completely ignoring the alert by downtiming it (if it gets worse, that's trouble)

The thought process is to create a new service for a custom swap check where we can set the threshold in the definition and manually attach this to hosts that have swapped, but can't be immediately service restarted to clear it.

In these cases, we'll downtime the original swap alert, apply the custom swap alert, and create a ticket for CAB approval for the proper maintenance.
Blocks: 901558
Assignee: infra → server-ops
Component: Infrastructure: Monitoring → Server Operations
Product: Infrastructure & Operations → mozilla.org
QA Contact: jdow → shyam
Priority: P1 → P2
Applying a custom swap alert and customizing the value on that alert is not exactly trivial with our Nagios configs.  It touches at least a couple files.  I can show you how to do it, but this doesn't seem like a slam dunk to me.  That being said, I'm not aware of a well-worn path for these kind of medium-term issues, so unless someone else has a better idea I guess we'll give this a try.
(In reply to Brandon Johnson [:cyborgshadow] from comment #0)

> The thought process is to create a new service for a custom swap check where
> we can set the threshold in the definition and manually attach this to hosts
> that have swapped, but can't be immediately service restarted to clear it.
> 
> In these cases, we'll downtime the original swap alert, apply the custom
> swap alert, and create a ticket for CAB approval for the proper maintenance.

Can we just downtime or ack these alerts instead? If you're already aware of swap being used...and know that it needs action later, it seems like a lot of work to downtime one and add another alert etc.

I'm sure we can just ack and move on, vs adding more work here?
The idea is to use the proper alert threshold for the proper machine. For example, if a backup or dev server routinely uses 70% or 80% of swap, that's OK. If it happens every other day, acking is a very "noisy" solution because everyone will still get paged.

Further, it's not OK if it uses 100% of swap, so acking is not really appropriate. What's appropriate is to be able to alert to the proper amount of swap used that is a concern, per machine.

We do downtime the alerts when we have a known solution, e.g. buildbot for a few weeks before the maintenance window. But dev in phx and stage in scl3 page occasionally for swap, about once every 2-3 weeks, and it's really annoying to be woken up at 3 am for a non-issue, because dev/stage is using 75% of swap.
:sheeri you're making it sound like permanent levels for each machine or group of machines, is that right?  If you provide a list of swap warning and CRITICAL pairs, as well as which machines they apply to, I can get them setup in nagios.
Assignee: server-ops → eziegenhorn
These 9 machines are the bulk of the noisy pages:

generic 1/2/3 - warning at 75%, critical at 85%

stage1/2 in scl3, dev1/2 in phx and scl3 - warning at 80%, critical at 90%
oh, and backup1/2/3/4 in phx and backup 3/4/5 in scl3 should also have warning at 80% and critical at 90%. So that's 16 machines total that should have different swap levels than the normal.
Eric,

If the numbers vary, we should just make the check generic and pass the warning/critical values per host. But this might need some futzing around with the check or some re-writing :| 

Or do a different swap check that can have params (so we don't have to change all the million hosts we have :))
This may be getting away from the original intent here. What Shyam mentioned in comment 7 is exactly what we want.

The purpose of this check:
- SWAP has already alerted on a database node.
- We can't restart the service on the node to clear it because it requires CAB approval.
- We don't want to have the current swap alert keep alerting, but instead measure a new (and maybe changing) value of SWAP free.
I checked the swap check and it already has specifiable thresholds:

The generic one we apply to all servers looks like:
            check_command => 'check_swap!50%!25%',

So to customize it for a particular server, you make/find a hostgroup that only that server is in and make a copy of check_swap_generic in services.pp with the new swap values applying to that hostgroup.  This is essentially what I proposed and as I said it's a bit involved.
We're OK with that proposal. I know it's involved, but do you have an idea of when you can get to it?
FWIW, while the theoretical ideal is a check where we can change the thresholds per-machine, there's no need to do the work required to refactor that. In practice, we really only need a few different levels and having these will be sufficient:

the normal warn at 50% free, crit at 25% free
a "medium" warn at 25% free, crit at 15% free
a "high usage" warn at 20% free, crit at 10% free
Created the db-swap-medium and db-swap-high hostgroups matching those thresholds in comment 11.  I applied them to the 16 machines in comment 5 and comment 6.

You probably already know this, but to be very explicit and clear, in the future if you want to add e.g. the 25%/15% free swap "medium" threshold to a box change it from this example:

puppet/trunk/modules/nagios/manifests/hosts/scl3.pp:

        'webdev1.db.scl3.mozilla.com' => {
            parents => 'boa-a1.r101-5.console.scl3.mozilla.com',
            hostgroups => [
                'hp-servers',
                'mysql2-puppetized-servers',
                'mysql-masters-nopage',
                'mysql-checksum',
                'mysql-rw'
            ]
        },

to this by just adding the db-swap-medium hostgroup and making sure everything is comma-separated:

        'webdev1.db.scl3.mozilla.com' => {
            parents => 'boa-a1.r101-5.console.scl3.mozilla.com',
            hostgroups => [
                'hp-servers',
                'mysql2-puppetized-servers',
                'mysql-masters-nopage',
                'mysql-checksum',
                'mysql-rw',
                'db-swap-medium',
            ]
        },

The part most likely to go wrong is missing the comma after the next-to-last hostgroup, 'mysql-rw' in this example.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.