Closed Bug 528573 Opened 15 years ago Closed 14 years ago

Need to drop and reload the b01 database master and slave for disk space

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justdave, Assigned: justdave)

Details

(Whiteboard: Sun 01/03 12:00pm PST)

The b01 database cluster recently had a database removed that was using about 120 GB of disk space.  Unfortunately it was innodb, too, and innodb only grows, it never shrinks.  In order to reclaim the disk space, we need to completely drop the entire data storage and reload it from scratch via mysqldump and import.  This requires taking down both the master and the slave and reloading both of them individually.  I expect the process to take between 30 and 60 minutes on each server.
This should not be scheduled the same night as the kernel upgrades because I won't have time to deal with both at once.
Flags: needs-downtime+
What if someone else is handling kernel upgrades?
Sure.  This will actually take a while to run, on both ends, I can start it and go work on kernels while I wait for it to finish, too, I guess.
I'd be glad to give Dave a hand with anything he needs.
this will affect graphs.mozilla.org for the downtime...
Assignee: server-ops → justdave
(In reply to comment #5)
> this will affect graphs.mozilla.org for the downtime...

Does this mean that http posts to graphs.m.o would fail out? If so, talos jobs will fail out - we'd need to coordinate closing the tree for the duration.

When are you thinking of doing this? And approx how long would it take?
(In reply to comment #6)
> (In reply to comment #5)
> > this will affect graphs.mozilla.org for the downtime...
> 
> Does this mean that http posts to graphs.m.o would fail out? If so, talos jobs
> will fail out - we'd need to coordinate closing the tree for the duration.

Yes.

> When are you thinking of doing this? And approx how long would it take?

Tuesday night, but I'm open to changing that around to accomodate.  I suspect it'll take somewhere between 15 and 60 minutes.
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > this will affect graphs.mozilla.org for the downtime...
> > 
> > Does this mean that http posts to graphs.m.o would fail out? If so, talos jobs
> > will fail out - we'd need to coordinate closing the tree for the duration.
> Yes.
ok.

> > When are you thinking of doing this? And approx how long would it take?
> Tuesday night, but I'm open to changing that around to accomodate.  I suspect
> it'll take somewhere between 15 and 60 minutes.
We already have a Talos downtime scheduled for 9am-11am Monday (see dev.planning "Talos downtime, Monday November 16th 9-11am PST"). It would be great if we could (safely) do all this in one downtime. Would your db work be ready to go ride-along Monday morning?
This will also affect the following sites:

bonsai
buildbot
despot
graphs_mozilla_org
litmus
viewvc_svn
MDC
Guessing this didn't happen on the 16th.  When can this happen this week?
Whiteboard: 12/03/2009 @ 8pm
We had a bunch of miscommunication about this this morning...  we were apparently going to try to do this at 9am this morning, but no downtime notice got sent for it.  The decision to do 9am happened in the afternoon yesterday, though.  Because of MDC, SVN, and Litmus being affected, we either need to do this in a normal Tuesday/Thursday downtime window, or have 24 hours or more advanced notice when the downtime notice goes out if we do it outside of one of those windows, because this will be a (rather long) user-facing outage.
Assignee: justdave → mrz
Whiteboard: 12/03/2009 @ 8pm → 12/17/2009 @ 8pm
Assignee: mrz → justdave
So when's a good time for build for us to do this?
per beltzner: Its blocked waiting until after we do the 3.6RC builds.
Whiteboard: 12/17/2009 @ 8pm → "Really Soon Now" according to joduinn
whenever we get to this, we probably want to try to do bug 535859 at the same time.
Whiteboard: "Really Soon Now" according to joduinn → Sun 01/03 12:00pm PST
reminder to myself: want to reconfig innodb for file_per_table while we do this, which will prevent this situation from coming up again in the future.
reload began around 12:20pm (right after the firmware updates from bug 535859 completed).  As of now, it's still in progress.  tm-b01-slave01 will be around 15 minutes later than master01 coming back, probably, because I forgot to run screen first, and got disconnected 10 minutes in... :|  master01 was running in screen, fortunately (I did get disconnected from both at the same time).
master's up, slave should be done any minute now.
and the slave is up.  All done.  Just under the wire. :)
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.