Closed Bug 762816 Opened 12 years ago Closed 12 years ago

Nagios change: take away 1 am downtime on intranet2

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

VERIFIED INVALID

People

(Reporter: scabral, Assigned: afernandez)

Details

between 1-2 am pacific (4 am eastern) I get paged about intranet2 being behind in replication. This is likely due to jabba's defrag that happens nightly. let's downtime replication being behind during this maintenance (We have to lookup exactly when the maintenance is, then check out the nagios history to see how long the downtime should be, and it's just for replication lag).
Assignee: server-ops-database → afernandez
[root@puppetdashboard1.private.phx1 cron.d]# hostname
puppetdashboard1.private.phx1.mozilla.com
[root@puppetdashboard1.private.phx1 cron.d]# crontab -l
# HEADER: This file was autogenerated at Wed May 30 17:08:40 -0700 2012 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: homeclean
MAILTO=infra-notices@mozilla.com
0 3 * * * /usr/local/bin/homeclean.sh > /dev/null
# Puppet Name: prune-reports
0 1 * * * cd /usr/share/puppet-dashboard/; /usr/bin/rake RAILS_ENV=production reports:prune upto=4 unit=day
# Puppet Name: optimize-db
0 4 * * 0 cd /usr/share/puppet-dashboard/; /usr/bin/rake RAILS_ENV=production db:raw:optimize
Is there an ETA on this? I get paged around 4:45 am Eastern every morning, and it's getting a bit tiresome.
:sheeri for the time being, I increased the replication lag, so shouldn't page you at ~4am EST.

Basically did the same thing that was done in Bug 760789.

Will fix correctly on Monday.
Just for general update, as far as paging, its working as intended.
No rush in fixing correctly but it's in the TODO list.
Summary: downtime intranet2 during the 1 am hour maintenance → Nagios change: downtime intranet2 during the 1 am hour maintenance
Is this regular downtime still required for this? Since the last update, the puppetdashboard DB servers are no more under DBA monitoring. I'm not sure if that is intentional but just adding that observation here.
good point. In fact, intranet doesn't even house puppetdashboard any more.

We should probably take the increase in replication lag off intranet2 actually.
Summary: Nagios change: downtime intranet2 during the 1 am hour maintenance → Nagios change: take away 1 am downtime on intranet2
The actual change that I made was on: puppetdashboard2.db.phx1.mozilla.com which used "mysql-lazy-repl"

Seems these hosts were reinstalled some time ago and the current checks only have;
"generic"
"hp-servers"

As for the purpose of this bug, nothing else to do.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
That's OK by me, verifying. Thanx!
Status: RESOLVED → VERIFIED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.