Request to reboot tree-closing database servers during next maintenance window

RESOLVED FIXED

Status

Infrastructure & Operations
Change Requests
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: cyborgshadow, Unassigned)

Tracking

Details

(Reporter)

Description

5 years ago
date, time, duration of maintenance
Next maintenance window

system(s) affected
generic1.db.scl3
generic3.db.phx1
buildbot1
builder-addons1
sentry1
tbpl1
db1.iddb
bugzilla1.db.scl3


end-user impact
databases will be unavailable for roughly 5-10 minutes for each server.

maintenance plan and timeline (link to a wiki or etherpad is fine)
This is only a server reboot, applying changes already deployed by puppet that has been tested and proven to work. (adding noatime/nodiratime to the data volume)

rollback plan / rollback point (at which point will you determine to roll back)
If the system fails to reboot, we can PXE boot it.


notification mechanisms
Normal maintenance window downtime.

who will be point, who else will be involved 
DB team will be point and handle all reboots. If any tree-closing apps can't re-establish their db connection safely, their team should be involved.
(Reporter)

Updated

5 years ago
Flags: cab-review?
(Reporter)

Comment 1

5 years ago
per request during CAB, when we reboot generic3 in phx1, let's coordinate a shutdown of etherpad app first, prior to db going down.

Updated

5 years ago
Blocks: 917928

Updated

5 years ago
Depends on: 917929
Tentatively approved for the next tree closing window Oct 12th. CC'ing some service owners so they know of potential impact.
Group: infra
Flags: cab-review? → cab-review+
Blocks: 919081
We realized we did not actually need to perform a reboot - we were changing mountpoint options to be more efficient, and doing it through puppet, puppet remounts the directories right away. In tests, machines had no problems remounting /, so we just did it without rebooting.

All of the following were done today:
generic1.db.scl3
generic3.db.phx1
buildbot1
builder-addons1
sentry1
tbpl1
bugzilla1.db.scl3

This one was not done:
db1.iddb

It is the identity db, and is not puppetized by us, and I was not about to live remount a system without having tested first (especially when I would have been remounting /).

We have a spreadsheet with what's done and not done at: https://docs.google.com/a/mozilla.com/spreadsheet/ccc?key=0AvGP1OghOtJSdC1FTnlTQmtxZVRkbG1NM1FlYkUtQlE&usp=drive_web#gid=0
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations

Updated

3 years ago
Cab Review: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.