Please stop hardware alerts from paging DBAs

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: sheeri, Assigned: ashish)

Tracking

Details

(Reporter)

Description

6 years ago
Can we please get the HP log check on the following machines to only e-mail us in case of warnings (infra-dbnotices should be e-mailed) for database machines such as b1-db1.db.scl3.mozilla.com (any db machine is a machine that pages the dba team).

Note well - the WARNING should e-mail, the CRITICAL should still page.

As evidence, pages I received on Sunday April 29th:
3:08 am - HP log on b1-db1.db.scl3.mozilla.com: Service Check Timed Out
3:18 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

4:13 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure


5:18 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

6:18 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

7:18 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

8:22 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

9:22 am - HP log on b1-db1.db.scl3.mozilla.com: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

9:22 am Aj acknowledged
Assignee: server-ops → rtucker
There is no way that I know of to alter this behavior for just these hosts as the checks are maintained via puppet.

CCing jabba

Comment 2

6 years ago
Could probably change the hplog_mysql service definition to just page oncall and e-mail dbnotices.. If you want the warning to e-mail and critical to page, you'll need to create two new definitions, one with notification options "w" and one with options "c" with different contactgroups for each and apply them both, I think.. Alternatively one could perhaps change the notification options for the infra@db-notices account and the dbadmins account to only accept "w" and "c" respectively, but I think that is global for all checks. Or just remove the hplog_mysql service and use the regular hplog service and let oncall deal with it?
sheeri?
My vote would be to remove the hplog_mysql check until the issue with these drives is resolved.
(Reporter)

Comment 4

6 years ago
Jabba's suggestion to e-mail warnings and page for criticals is what I'd go for. The problem with disabling the check is that warnings are false positives, but criticals are real. That's why I haven't just downtime'd the check itself.
sherri,
What is the requirement for you getting alerted on a hardware warning? The SRE's will get paged regardless.
(Reporter)

Comment 6

6 years ago
The requirement is that the check go CRITICAL. If the check goes WARNING then e-mail is enough.
sheeri,
3 choices, your choice:
1. disable the specific dbadmins check and leave the check that alerts sysadmins in place until the drive issue gets fixed.
2. leave it as is
3. I can punt this back to the SREs to take care of. Dumitru assigned this directly to me for some unknown reason. SREs handle nagios now.

I do not have the cycles right now to spend the time on a duplicate check and splitting out the contact groups and contact options that is specifically going to you. As long as the SREs get the alerts we're protected, you getting them is just a bonus.
(Reporter)

Comment 8

6 years ago
Punt to the SRE's please.

Updated

6 years ago
Assignee: rtucker → server-ops

Comment 9

6 years ago
Please stop paging DBA about hardware.  Make this only go to on-call.
(Assignee)

Comment 10

6 years ago
For now, I've removed dbadmins for alerts from hplog, hp health and hp raid in scl3. They will now all page the oncall vs. irc-only.

:sheeri - as of now, all checks for databases alert the DBA too. Here is a list of all checks, please list only the ones that should page the DBA:

b1-db1.db.scl3.mozilla.com:DB Disk - All is OK - DISK OK - free space: / 111374 MB (52% inode=99%): /dev/shm 12007 MB (100% inode=99%): /boot 40 MB (43% inode=99%):
b1-db1.db.scl3.mozilla.com:DB Load is OK - OK - load average: 0.28, 0.34, 0.35
b1-db1.db.scl3.mozilla.com:DB SSH is OK - SSH OK - OpenSSH_5.3 (protocol 2.0)
b1-db1.db.scl3.mozilla.com:DB Swap is OK - SWAP OK - 97% free (1984 MB out of 2047 MB)
b1-db1.db.scl3.mozilla.com:DB Time Sync is OK - NTP OK: Offset 0.001994967461 secs
b1-db1.db.scl3.mozilla.com:HP Health is OK - OK - System: 'proliant bl460c g7', S/N: 'MXQ20209L9', ROM: 'I27 05/05/2011', hardware working fine
b1-db1.db.scl3.mozilla.com:HP Log is WARNING - WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure
b1-db1.db.scl3.mozilla.com:HP RAID is OK - RAID OK:  Smart Array P410i in Slot 0 (Embedded) array A logicaldrive 1 (223.5 GB, RAID 1, OK) [Controller Status: OK Cache Status: OK Battery/Capacitor Status: OK]
b1-db1.db.scl3.mozilla.com:MySQL is OK - Uptime: 1432736  Threads: 626  Questions: 631741513  Slow queries: 1860  Opens: 7244  Flush tables: 3 Open tables: 1024  Queries per second avg: 440.933
b1-db1.db.scl3.mozilla.com:PING is OK - PING OK - Packet loss = 0%, RTA = 0.83 ms
Status: NEW → ASSIGNED
(Assignee)

Updated

6 years ago
Summary: Please stop the HP log warning from paging → Please stop hardware alerts from paging DBAs
(Reporter)

Comment 11

6 years ago
SRE's will still get paged, so I'm going to note that the check should still be separated out so warnings e-mail and criticals page. (or maybe warnings page during business hours?) You're still paging people for a false positive until you do that.

As for my own little world, here's what the dbadmin (both myself and mpressman) should be paged on:


b1-db1.db.scl3.mozilla.com:DB Disk - All is OK - DISK OK - free space: / 111374 MB (52% inode=99%): /dev/shm 12007 MB (100% inode=99%): /boot 40 MB (43% inode=99%):
b1-db1.db.scl3.mozilla.com:DB Load is OK - OK - load average: 0.28, 0.34, 0.35
b1-db1.db.scl3.mozilla.com:DB SSH is OK - SSH OK - OpenSSH_5.3 (protocol 2.0)
b1-db1.db.scl3.mozilla.com:DB Swap is OK - SWAP OK - 97% free (1984 MB out of 2047 MB)
b1-db1.db.scl3.mozilla.com:DB Time Sync is OK - NTP OK: Offset 0.001994967461 secs
b1-db1.db.scl3.mozilla.com:MySQL is OK - Uptime: 1432736  Threads: 626  Questions: 631741513  Slow queries: 1860  Opens: 7244  Flush tables: 3 Open tables: 1024  Queries per second avg: 440.933
b1-db1.db.scl3.mozilla.com:PING is OK - PING OK - Packet loss = 0%, RTA = 0.83 ms

So basically not paging us on the HP stuff for us is perfect (and could be global).
(Assignee)

Updated

6 years ago
Assignee: server-ops → ashish
(Assignee)

Comment 12

6 years ago
(In reply to Sheeri Cabral [:sheeri] from comment #11)
> So basically not paging us on the HP stuff for us is perfect (and could be
> global).

All is done here, then. DBAs will not get paged for all hp stuff. The behaviour for the remaining checks stays as-is.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.