Closed Bug 778797 Opened 8 years ago Closed 8 years ago

Nagios checks for socorro staging

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86
macOS
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bburton, Assigned: ashish)

References

Details

Due to ongoing issues with socorro staging, we want to add Nagios checks to try and correlate failures we're seeing from application services and Zeus monitors.

Postgres

Add the checks for tp-socorro01-master01.phx1.mozilla.com to the following:

    socorro1.stage.db.phx1 - https://inventory.mozilla.org/en-US/systems/show/1550/

Processors

Add the checks for sp-processor01.phx1.mozilla.com to the following:

    socorro-processor1.stage.metrics.phx1 - https://inventory.mozilla.org/en-US/systems/show/3921/
    socorro-processor2.stage.metrics.phx1 - https://inventory.mozilla.org/en-US/systems/show/3923/

HBase

Add the checks for hp-node01.phx1.mozilla.com to the following:

    10.8.100.62:9090 - hp-node62.phx1 - https://inventory.mozilla.org/en-US/systems/show/2568/
    10.8.100.63:9090 - hp-node63.phx1 - https://inventory.mozilla.org/en-US/systems/show/2569/
    +10.8.100.64:9090 - hp-node64.phx1 - https://inventory.mozilla.org/en-US/systems/show/2570/
    10.8.100.65:9090 - hp-node65.phx1 - https://inventory.mozilla.org/en-US/systems/show/2571/
    10.8.100.66:9090 - hp-node66.phx1 - https://inventory.mozilla.org/en-US/systems/show/2572/
    10.8.100.67:9090 - hp-node67.phx1 - https://inventory.mozilla.org/en-US/systems/show/2573/
    +10.8.100.68:9090 - hp-node68.phx1 - https://inventory.mozilla.org/en-US/systems/show/2574/
    10.8.100.69:9090 - hp-node69.phx1 - https://inventory.mozilla.org/en-US/systems/show/2575/
Blocks: 771218
Assignee: server-ops → mburns
Postgres:
    socorro1.stage.db.phx1 has the additional group "postgres-pgdata",

    which I've left it in place, and added the missing group that tp-socorro01-master01.phx1 had ("pg-servers", "pgbouncer-servers", "pgmaster-servers")


Processor:
    socorro-processor{1,2}.stage.metrics.phx1 already match sp-processor01.phx1


HBase:
    added "thrift-nodes" group to all hp-node6{2-9}.phx1, 

    additionally hp-node6{5,6,7}.phx1 are part of the "elasticsearch" group, I added "hadoop-nodes" to them
svn ci -m "added a socorro1.stage.db.phx1 check for check_postgres_replicate_row"
Sending        nagios/manifests/mozilla/checkcommands.pp
Transmitting file data .
Committed revision 44063.

This should make all hosts equal in Nagios' eyes.

Let me know if I ruined/missed anything.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
16:04:35 < nagios-phx1> | [118] hp-node68.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1
16:12:05 < nagios-phx1> | [123] hp-node64.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1

Also, the check_postgres_replicate_row check for socorro1.stage.db.phx1 is incorrect. What is the right slave db for socorro1.stage.db.phx1?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Ashish Vijayaram [:ashish] from comment #3)
> Also, the check_postgres_replicate_row check for socorro1.stage.db.phx1 is
> incorrect. What is the right slave db for socorro1.stage.db.phx1?

I've also removed the duplicate check_postgres_replicate_row check since it would generate false alerts from prod (tp-socorro01-master01).
Assignee: mburns → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: phong → jdow
Assignee: server-ops-infra → ashish
socorro1.stage.db.phx1.mozilla.com is missing ganglia plugins that are present in tp-socorro01-master01.phx1.mozilla.com. This is causing all ganglia-based postgres checks to fail. Given that puppet is disabled, I'll have to manually copy over the plugins to fix the alerts.
Status: REOPENED → ASSIGNED
All the ganglia-based checks recovered and are green:

23:49:40 < nagios-phx1> ashish: socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Connections is OK - CHECKGANGLIA OK: pg_connections is 0.00 Last Checked: 2012-08-06 23:47:59 PDT
23:49:40 < nagios-phx1> ashish: socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Jobs in Queue is OK - CHECKGANGLIA OK: jobs_in_queue is 0.00 Last Checked: 2012-08-06 23:47:59 PDT
23:49:41 < nagios-phx1> ashish: socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Last Reports Update is OK - CHECKGANGLIA OK: last_record_reports is 11.00 Last Checked: 2012-08-06 23:47:59 PDT
23:49:41 < nagios-phx1> ashish: socorro1.stage.db.phx1.mozilla.com:Ganglia - PostgreSQL Performance Test Query is OK - CHECKGANGLIA OK: pg_timed_query is 0.00 Last Checked: 2012-08-06 23:47:58 PDT
23:49:43 < nagios-phx1> ashish: socorro1.stage.db.phx1.mozilla.com:Ganglia - pgBouncer connections is OK - CHECKGANGLIA OK: pgb_clients is 0.00 Last Checked: 2012-08-06 23:47:58 PDT
Status: ASSIGNED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Not sure if the following was fixed (as it alerted again today);
< nagios-phx1> | Tue 19:08:38 PDT [170] hp-node68.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1

and checking the nagios GUI, the same goes for hp-node64.phx1.mozilla.com
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Adrian Fernandez [:Aj] from comment #7)
> Not sure if the following was fixed (as it alerted again today);
> < nagios-phx1> | Tue 19:08:38 PDT [170] hp-node68.phx1.mozilla.com:Zeus -
> Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at
> error-index 1
> 
> and checking the nagios GUI, the same goes for hp-node64.phx1.mozilla.com

Yep, I checked up and these two hosts are disabled in the Zeus pool. This alert is expected in such an event. In fact, the corresponding virtual server is in a "Stopped" state.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.