Closed Bug 762816 Opened 12 years ago Closed 12 years ago

Nagios change: take away 1 am downtime on intranet2

Tracking

(Not tracked)

Status:

VERIFIED INVALID

People

(Reporter: scabral, Assigned: afernandez)

Details

Sheeri Cabral [:sheeri]

Reporter

Description

•

12 years ago

between 1-2 am pacific (4 am eastern) I get paged about intranet2 being behind in replication. This is likely due to jabba's defrag that happens nightly. let's downtime replication being behind during this maintenance (We have to lookup exactly when the maintenance is, then check out the nagios history to see how long the downtime should be, and it's just for replication lag).

Adrian J Fernandez [:Aj]

Assignee

Updated

•

12 years ago

Assignee: server-ops-database → afernandez

Sheeri Cabral [:sheeri]

Reporter

Comment 1

•

12 years ago

[root@puppetdashboard1.private.phx1 cron.d]# hostname
puppetdashboard1.private.phx1.mozilla.com
[root@puppetdashboard1.private.phx1 cron.d]# crontab -l
# HEADER: This file was autogenerated at Wed May 30 17:08:40 -0700 2012 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: homeclean
MAILTO=infra-notices@mozilla.com
0 3 * * * /usr/local/bin/homeclean.sh > /dev/null
# Puppet Name: prune-reports
0 1 * * * cd /usr/share/puppet-dashboard/; /usr/bin/rake RAILS_ENV=production reports:prune upto=4 unit=day
# Puppet Name: optimize-db
0 4 * * 0 cd /usr/share/puppet-dashboard/; /usr/bin/rake RAILS_ENV=production db:raw:optimize

Sheeri Cabral [:sheeri]

Reporter

Comment 2

•

12 years ago

Is there an ETA on this? I get paged around 4:45 am Eastern every morning, and it's getting a bit tiresome.

Adrian J Fernandez [:Aj]

Assignee

Comment 3

•

12 years ago

:sheeri for the time being, I increased the replication lag, so shouldn't page you at ~4am EST.

Basically did the same thing that was done in Bug 760789.

Will fix correctly on Monday.

Adrian J Fernandez [:Aj]

Assignee

Comment 4

•

12 years ago

Just for general update, as far as paging, its working as intended.
No rush in fixing correctly but it's in the TODO list.

Sheeri Cabral [:sheeri]

Reporter

Updated

•

12 years ago

Summary: downtime intranet2 during the 1 am hour maintenance → Nagios change: downtime intranet2 during the 1 am hour maintenance

Ashish Vijayaram [:ashish]

Comment 5

•

12 years ago

Is this regular downtime still required for this? Since the last update, the puppetdashboard DB servers are no more under DBA monitoring. I'm not sure if that is intentional but just adding that observation here.

Sheeri Cabral [:sheeri]

Reporter

Comment 6

•

12 years ago

good point. In fact, intranet doesn't even house puppetdashboard any more.

We should probably take the increase in replication lag off intranet2 actually.

Sheeri Cabral [:sheeri]

Reporter

Updated

•

12 years ago

Summary: Nagios change: downtime intranet2 during the 1 am hour maintenance → Nagios change: take away 1 am downtime on intranet2

Adrian J Fernandez [:Aj]

Assignee

Comment 7

•

12 years ago

The actual change that I made was on: puppetdashboard2.db.phx1.mozilla.com which used "mysql-lazy-repl"

Seems these hosts were reinstalled some time ago and the current checks only have;
"generic"
"hp-servers"

As for the purpose of this bug, nothing else to do.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → INVALID

Sheeri Cabral [:sheeri]

Reporter

Comment 8

•

12 years ago

That's OK by me, verifying. Thanx!

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Nagios change: take away 1 am downtime on intranet2

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: scabral, Assigned: afernandez)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Updated