Closed Bug 822415 Opened 12 years ago Closed 12 years ago

Change the monitor for Socorro replication to use hot_standby_delay instead of replicate_row

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Assigned: dumitru)

References

Details

We had a problem where the check returned a false positive.

:dumitru is working on changing this.
Assignee: server-ops → dgherman
There is a bug in the perl script that doesn't properly handle '--port2' and '--host2' parameters as you'd expect from the documentation.

If you specify this like so: 

--port=5432,5432
--host=host1,host2

That should fix the error that we saw on Tuesday.
[root@nagios1.private.phx1 mozilla]# ./check_postgres.pl --host=tp-socorro01-master01.phx1.mozilla.com,tp-socorro01-master02.phx1.mozilla.com --dbuser=nagiosdaemon --dbname=breakpad --action=hot_standby_delay  --dbuser2=nagiosdaemon --dbname2=breakpad  --warning=10 --critical=20
Password for user nagiosdaemon:
Password for user nagiosdaemon:
Password for user nagiosdaemon:
Password for user nagiosdaemon:
POSTGRES_HOT_STANDBY_DELAY OK: DB "breakpad" (host:tp-socorro01-master01.phx1.mozilla.com) -25328 | time=0.42s replay_delay=-25328;10;20  receive-delay=-25328;10;20
[root@nagios1.private.phx1 mozilla]# ./check_postgres.pl -V
check_postgres.pl version 2.19.0


So, remember when I first tried the hot_standby_delay check with the older version of the script? It returned the same big negative values.
Yeah, it's because the standby receives WAL between the time the check runs on the master (first) and the check runs on the replica (second).

It's not beautiful, but it does accurately represent the state of the system. They're looking at changing the logic for the script to ask the replica for it's location first.
I see.
Does this mean we can switch to this check? If so, we need to fine tune it to alert us when the deltas are too high.
Blocks: 823507
Let's try setting the delta to warn at 16777216 (that's in bytes, 16 MB). It should never get that high if things are working.
ping!
Completed:

[10:05] <nagios-phx1> | dumitru: tp-socorro01-master02.phx1.mozilla.com:PostgreSQL Hot Standby Delay is OK - POSTGRES_HOT_STANDBY_DELAY OK: (host:tp-socorro01-master01.phx1.mozilla.com =>
                      tp-socorro01-master02.phx1.mozilla.com) 0 Last Checked: 2013-01-04 10:02:41 PST


So I replaced the "replicate_row" with "hot_standby_delay".
Thresholds are: warning at 16777216 and critical at 33554432.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.