Closed Bug 917856 Opened 11 years ago Closed 11 years ago

Request to reboot tree-closing database servers during next maintenance window

Categories

(Infrastructure & Operations :: Change Requests, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bjohnson, Unassigned)

References

Details

date, time, duration of maintenance
Next maintenance window

system(s) affected
generic1.db.scl3
generic3.db.phx1
buildbot1
builder-addons1
sentry1
tbpl1
db1.iddb
bugzilla1.db.scl3


end-user impact
databases will be unavailable for roughly 5-10 minutes for each server.

maintenance plan and timeline (link to a wiki or etherpad is fine)
This is only a server reboot, applying changes already deployed by puppet that has been tested and proven to work. (adding noatime/nodiratime to the data volume)

rollback plan / rollback point (at which point will you determine to roll back)
If the system fails to reboot, we can PXE boot it.


notification mechanisms
Normal maintenance window downtime.

who will be point, who else will be involved 
DB team will be point and handle all reboots. If any tree-closing apps can't re-establish their db connection safely, their team should be involved.
Flags: cab-review?
per request during CAB, when we reboot generic3 in phx1, let's coordinate a shutdown of etherpad app first, prior to db going down.
Blocks: 917928
Depends on: 917929
Tentatively approved for the next tree closing window Oct 12th. CC'ing some service owners so they know of potential impact.
Group: infra
Flags: cab-review? → cab-review+
Blocks: 919081
We realized we did not actually need to perform a reboot - we were changing mountpoint options to be more efficient, and doing it through puppet, puppet remounts the directories right away. In tests, machines had no problems remounting /, so we just did it without rebooting.

All of the following were done today:
generic1.db.scl3
generic3.db.phx1
buildbot1
builder-addons1
sentry1
tbpl1
bugzilla1.db.scl3

This one was not done:
db1.iddb

It is the identity db, and is not puppetized by us, and I was not about to live remount a system without having tested first (especially when I would have been remounting /).

We have a spreadsheet with what's done and not done at: https://docs.google.com/a/mozilla.com/spreadsheet/ccc?key=0AvGP1OghOtJSdC1FTnlTQmtxZVRkbG1NM1FlYkUtQlE&usp=drive_web#gid=0
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.