901547 - Make custom % SWAP alert for database systems we can assign temporarily.

Reporter

Description

•

11 years ago

Swap alerted on buildbot1 today.

Since we can't restart the service or fail it over (both cause a small "blip" of downtime that has been deemed unacceptable by CAB) until an approved window, we need a way to be able to still monitor these no-downtime systems once their swap alert goes off, without either a.) adjusting the swap alert for all hosts in the hostgroup or b.) completely ignoring the alert by downtiming it (if it gets worse, that's trouble)

The thought process is to create a new service for a custom swap check where we can set the threshold in the definition and manually attach this to hosts that have swapped, but can't be immediately service restarted to clear it.

In these cases, we'll downtime the original swap alert, apply the custom swap alert, and create a ticket for CAB approval for the proper maintenance.

Brandon Johnson [:cyborgshadow]

Reporter

Updated

•

11 years ago

Blocks: 901558

Sheeri Cabral [:sheeri]

Updated

•

11 years ago

Assignee: infra → server-ops

Component: Infrastructure: Monitoring → Server Operations

Product: Infrastructure & Operations → mozilla.org

QA Contact: jdow → shyam

Sheeri Cabral [:sheeri]

Updated

•

11 years ago

Priority: P1 → P2

Eric Ziegenhorn :ericz

Assignee

Comment 1

•

11 years ago

Applying a custom swap alert and customizing the value on that alert is not exactly trivial with our Nagios configs.  It touches at least a couple files.  I can show you how to do it, but this doesn't seem like a slam dunk to me.  That being said, I'm not aware of a well-worn path for these kind of medium-term issues, so unless someone else has a better idea I guess we'll give this a try.

Shyam Mani [:fox2mike]

Comment 2

•

11 years ago

(In reply to Brandon Johnson [:cyborgshadow] from comment #0)

> The thought process is to create a new service for a custom swap check where
> we can set the threshold in the definition and manually attach this to hosts
> that have swapped, but can't be immediately service restarted to clear it.
> 
> In these cases, we'll downtime the original swap alert, apply the custom
> swap alert, and create a ticket for CAB approval for the proper maintenance.

Can we just downtime or ack these alerts instead? If you're already aware of swap being used...and know that it needs action later, it seems like a lot of work to downtime one and add another alert etc.

I'm sure we can just ack and move on, vs adding more work here?

Sheeri Cabral [:sheeri]

Comment 3

•

11 years ago

The idea is to use the proper alert threshold for the proper machine. For example, if a backup or dev server routinely uses 70% or 80% of swap, that's OK. If it happens every other day, acking is a very "noisy" solution because everyone will still get paged.

Further, it's not OK if it uses 100% of swap, so acking is not really appropriate. What's appropriate is to be able to alert to the proper amount of swap used that is a concern, per machine.

We do downtime the alerts when we have a known solution, e.g. buildbot for a few weeks before the maintenance window. But dev in phx and stage in scl3 page occasionally for swap, about once every 2-3 weeks, and it's really annoying to be woken up at 3 am for a non-issue, because dev/stage is using 75% of swap.

Eric Ziegenhorn :ericz

Assignee

Comment 4

•

11 years ago

:sheeri you're making it sound like permanent levels for each machine or group of machines, is that right?  If you provide a list of swap warning and CRITICAL pairs, as well as which machines they apply to, I can get them setup in nagios.

Eric Ziegenhorn :ericz

Assignee

Updated

•

11 years ago

Assignee: server-ops → eziegenhorn

Sheeri Cabral [:sheeri]

Comment 5

•

11 years ago

These 9 machines are the bulk of the noisy pages:

generic 1/2/3 - warning at 75%, critical at 85%

stage1/2 in scl3, dev1/2 in phx and scl3 - warning at 80%, critical at 90%

Sheeri Cabral [:sheeri]

Comment 6

•

11 years ago

oh, and backup1/2/3/4 in phx and backup 3/4/5 in scl3 should also have warning at 80% and critical at 90%. So that's 16 machines total that should have different swap levels than the normal.

Shyam Mani [:fox2mike]

Comment 7

•

11 years ago

Eric,

If the numbers vary, we should just make the check generic and pass the warning/critical values per host. But this might need some futzing around with the check or some re-writing :| 

Or do a different swap check that can have params (so we don't have to change all the million hosts we have :))

Brandon Johnson [:cyborgshadow]

Reporter

Comment 8

•

11 years ago

This may be getting away from the original intent here. What Shyam mentioned in comment 7 is exactly what we want.

The purpose of this check:
- SWAP has already alerted on a database node.
- We can't restart the service on the node to clear it because it requires CAB approval.
- We don't want to have the current swap alert keep alerting, but instead measure a new (and maybe changing) value of SWAP free.

Eric Ziegenhorn :ericz

Assignee

Comment 9

•

11 years ago

I checked the swap check and it already has specifiable thresholds:

The generic one we apply to all servers looks like:
            check_command => 'check_swap!50%!25%',

So to customize it for a particular server, you make/find a hostgroup that only that server is in and make a copy of check_swap_generic in services.pp with the new swap values applying to that hostgroup.  This is essentially what I proposed and as I said it's a bit involved.

Sheeri Cabral [:sheeri]

Comment 10

•

11 years ago

We're OK with that proposal. I know it's involved, but do you have an idea of when you can get to it?

Sheeri Cabral [:sheeri]

Comment 11

•

11 years ago

FWIW, while the theoretical ideal is a check where we can change the thresholds per-machine, there's no need to do the work required to refactor that. In practice, we really only need a few different levels and having these will be sufficient:

the normal warn at 50% free, crit at 25% free
a "medium" warn at 25% free, crit at 15% free
a "high usage" warn at 20% free, crit at 10% free

Eric Ziegenhorn :ericz

Assignee

Comment 12

•

11 years ago

Created the db-swap-medium and db-swap-high hostgroups matching those thresholds in comment 11.  I applied them to the 16 machines in comment 5 and comment 6.

You probably already know this, but to be very explicit and clear, in the future if you want to add e.g. the 25%/15% free swap "medium" threshold to a box change it from this example:

puppet/trunk/modules/nagios/manifests/hosts/scl3.pp:

        'webdev1.db.scl3.mozilla.com' => {
            parents => 'boa-a1.r101-5.console.scl3.mozilla.com',
            hostgroups => [
                'hp-servers',
                'mysql2-puppetized-servers',
                'mysql-masters-nopage',
                'mysql-checksum',
                'mysql-rw'
            ]
        },

to this by just adding the db-swap-medium hostgroup and making sure everything is comma-separated:

        'webdev1.db.scl3.mozilla.com' => {
            parents => 'boa-a1.r101-5.console.scl3.mozilla.com',
            hostgroups => [
                'hp-servers',
                'mysql2-puppetized-servers',
                'mysql-masters-nopage',
                'mysql-checksum',
                'mysql-rw',
                'db-swap-medium',
            ]
        },

The part most likely to go wrong is missing the comma after the next-to-last hostgroup, 'mysql-rw' in this example.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Quick Search

Make custom % SWAP alert for database systems we can assign temporarily.

Categories

(mozilla.org Graveyard :: Server Operations, task, P2)

Tracking

(Not tracked)

People

(Reporter: bjohnson, Assigned: ericz)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated