Last Comment Bug 614786 - Rotate ftp staging site to new disk array
: Rotate ftp staging site to new disk array
Status: RESOLVED FIXED
[downtime 3 of 3 on Thu 2/3 6am pdt]
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: x86 Linux
: -- major (vote)
: ---
Assigned To: Dave Miller [:justdave] (justdave@bugzilla.org)
: matthew zeier [:mrz]
:
Mentors:
: 629064 (view as bug list)
Depends on: 617626
Blocks: 601025 625979 630538 630541
  Show dependency treegraph
 
Reported: 2010-11-25 03:10 PST by Dave Miller [:justdave] (justdave@bugzilla.org)
Modified: 2015-03-12 08:17 PDT (History)
15 users (show)
justdave: needs‑downtime+
bhearsum: needs‑treeclosure+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Dave Miller [:justdave] (justdave@bugzilla.org) 2010-11-25 03:10:19 PST
Here's the current situation:

Available disk arrays:
    Filesystem            Size  Used Avail Use%
[A] Netapp #1             3.1T  2.7T  408G  87%
[B] EQL via NFS           2.0T  1.8T  247G  88%
[C] Netapp #2             4.8T  2.8T  2.0T  59%

Currently mounted as:
[A] /pub/mozilla.org
[B] /pub/mozilla.org/firefox
[C] not yet in use
With 5.1 TB of total space available

The plan:
[A] /pub/mozilla.org/firefox
[B] going away
[C] /pub/mozilla.org
With 7.9 TB of total space available

The original plan was to recombine everything onto [C], but since we're already using 4.5 TB (out of an available 4.8 TB on the new drive), I think it makes more sense to keep the old netapp array in the mix, and eliminate the iscsi-over-nfs hack.  Doing it this way, in addition to eliminating the performance issues with the iscsi-over-nfs it's currently using, will add 1.8 TB of disk capacity instead of removing 0.3 TB.

Doing this move is going to require TWO downtimes.
1) Move [A] to [C]
2) Move [B] to [A]

We obviously can't do #2 until #1 is done, and there will be additional prep required between the two steps.

An initial sync of [A] to [C] has already been completed (as evident by the disk usage in the table at the top).  Incremental syncs have been tested to take approximately 70 minutes per run.  In order to ensure no dataloss we'll need to ensure nobody can write to the disk during the final sync before remounting the drives in their swapped positions.  I would recommend advertising a 2 hour outage for this, and we'll probably have it up and running again way sooner than that.  Our technology has improved.  The last time we did this (with only 1.5 TB of data at the time) it took over 6 hours. :)

The amount of time required for the [B]-to-[A] move is unknown, and there's no way to test it until [A] is freed up by the first move.  I suspect it'll take longer, despite the smaller dataset, because of the NFS-via-Linux step in the middle getting the data off the old drive; but that's only a theory until it's actually tested.
Comment 1 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-11-25 03:13:36 PST
A graphical diagram of the current setup is available at http://people.mozilla.org/~justdave/MirrorNetwork.pdf
Comment 2 Ben Hearsum (:bhearsum) 2010-11-25 12:24:10 PST
Step 1 sounds like it can be done in whatever the next downtime is. I'm on buildduty most of next week, we can probably figure something out. Could Step 2 wait until the holidays, when it's easier to get longer downtime windows?
Comment 3 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-11-25 12:28:20 PST
Sure.  We'll probably want to wait until we get a trial run on the incremental syncs for step 2 before deciding how long to wait.  It may surprise us and go faster for all we know.  Then again, it may not.  :)
Comment 4 Nick Thomas [:nthomas] 2010-11-25 13:58:35 PST
Sounds like a good plan to me, with the added advantage that the netapp partitions can be resized (storage permitting).

(In reply to comment #0)
> The amount of time required for the [B]-to-[A] move is unknown, and there's no
> way to test it until [A] is freed up by the first move.  I suspect it'll take
> longer, despite the smaller dataset, because of the NFS-via-Linux step in the
> middle getting the data off the old drive; but that's only a theory until it's
> actually tested.

No doubt you already thought of doing the syncs on dm-ftp01.m.o to avoid the extra NFS hop.
Comment 5 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-11-26 09:48:12 PST
(In reply to comment #4)
> No doubt you already thought of doing the syncs on dm-ftp01.m.o to avoid the
> extra NFS hop.

Actually, the thought had slipped my mind, but that's a good idea.  We'll have to fix the ACLs to allow us to mount it read/write over there (it's read-only currently), but that's certainly doable.
Comment 6 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-12-08 12:39:35 PST
justdave: could this be done on 17th? zandr will be doing a tree-closing downtime in bug#616658 that day, so it would be great to do this at the same time.
Comment 7 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-12-08 15:05:40 PST
Depends on the time of day.  I'll be doing my RHEL6 recertification exam for my RHCE that day, 9am to 4:30pm Central time, and given that it's downtown Chicago, I'd allow at least 90 minutes travel time to get back to my sister's place and get online afterwards.  So I guess if we're talking after 5pm pacific it'd probably work.
Comment 8 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-12-14 15:47:53 PST
(In reply to comment #6)
> justdave: could this be done on 17th? zandr will be doing a tree-closing
> downtime in bug#616658 that day, so it would be great to do this at the same
> time.

(In reply to comment #7)
> Depends on the time of day.  I'll be doing my RHEL6 recertification exam for my
> RHCE that day, 9am to 4:30pm Central time, and given that it's downtown
> Chicago, I'd allow at least 90 minutes travel time to get back to my sister's
> place and get online afterwards.  So I guess if we're talking after 5pm pacific
> it'd probably work.

Per zandr, the downtime will be from 8am to 5pm (Pacific), but that includes time for spinning back up systems after the recabling work is finished. 

On the 17th, your window would be from 8am to 2pm (Pacific). Does that work for you? If not, is there someone else in IT who can do this on your behalf?
Comment 9 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-12-14 23:21:45 PST
10am pacific might work.  I've actually got two separate exams, one is 9:00 to 11:30 central, the other 2:00 to 4:30 central, so other than grabbing lunch, I'll basically be sitting around doing nothing for 2.5 hours between the two exams.  With the estimated runtime for the switch being 90 minutes that'll probably be enough time to do it.
Comment 10 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-12-15 12:33:09 PST
*** Bug 617626 has been marked as a duplicate of this bug. ***
Comment 11 Ben Hearsum (:bhearsum) 2010-12-17 09:36:28 PST
Happening today.
Comment 12 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-12-18 10:30:52 PST
Part one happened yesterday.

Part 2 will depend on timing figuring out how long it'll take to sync the filesystems.  I expect the initial sync to take a day or two, and the followup syncs will determine how long of an outage we need.

Is RelEng happy with the state of stage right now? (data integrity I mean).  The next step is to wipe out the contents of the array we just vacated in prep for copying the firefox stuff into it, and I want to make sure we don't need it for a data reversion or something first.
Comment 13 Ben Hearsum (:bhearsum) 2010-12-22 13:34:42 PST
(In reply to comment #12)
> Part one happened yesterday.
> 
> Part 2 will depend on timing figuring out how long it'll take to sync the
> filesystems.  I expect the initial sync to take a day or two, and the followup
> syncs will determine how long of an outage we need.
> 
> Is RelEng happy with the state of stage right now? (data integrity I mean). 
> The next step is to wipe out the contents of the array we just vacated in prep
> for copying the firefox stuff into it, and I want to make sure we don't need it
> for a data reversion or something first.

Per IRC, we're happy with things and haven't seen any issues. Go ahead.
Comment 14 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-10 11:34:43 PST
ok, so to cleanly copy this stuff over to the new partition, I need to remove a couple of the bind mounts on dm-ftp01.  This *shouldn't* affect anything visible to production, but it depends on the order the mounts were initially set up, and there's a really slim chance that the tryserver and tinderbox directories might briefly disappear.

> * 10.253.0.139:/data/try-builds on /mnt/cm-ixstore01/try-builds type nfs (rw,noatime,rsize=32768,wsize=32768,nfsvers=3,proto=tcp,addr=10.253.0.139)

> * 10.253.0.139:/data/tinderbox-builds on /mnt/cm-ixstore01/tinderbox-builds type nfs (rw,noatime,rsize=32768,wsize=32768,nfsvers=3,proto=tcp,addr=10.253.0.139)

> * /mnt/eql/builds/firefox on /mnt/netapp/stage/archive.mozilla.org/pub/firefox type bind (ro,bind,_netdev)

> X /mnt/cm-ixstore01/try-builds/trybuilds on /mnt/eql/builds/firefox/tryserver-builds/old type none (rw,bind)

> * /mnt/cm-ixstore01/try-builds/trybuilds on /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tryserver-builds/old type bind (ro,bind,_netdev)

> X /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds on /mnt/eql/builds/firefox/tinderbox-builds type bind (ro,bind,_netdev)

> * /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds on /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tinderbox-builds type bind (ro,bind,_netdev)

The two with the X in front are the two I need to get rid of.  The ones under /mnt/netapp/stage are the ones that are visible on stage.m.o and ftp.m.o.  *IF* the ixstore mounts were mounted into eql before eql was mounted into netapp, *THEN* there's a chance that those directories will disappear from netapp when I unmount them from eql, which will require the netapp versions of those bind mounts to be unmounted and remounted.  If they were mounted afterwards, then they won't disappear and the production directories won't be affected.

Just to be safe, we're scheduling a downtime to do the unmounts.  This is tentatively Wed Jan 12 during EST AM.
Comment 15 Mike Taylor [:bear] 2011-01-10 13:24:57 PST
We may not be able to hit this downtime because of the requirement that even tho we have all our ducks in a row we still have to run this completely up the chain-of-command flagpole.

So, started that process just now and have tossed the ball to zandr since he can better coordinate with IT - you guys let me know when this gets scheduled.
Comment 16 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-12 06:35:32 PST
OK, the process in step 14 has been completed.  Turns out they were mounted in the correct order, so we did *not* wind up having any downtime on the production paths, and we could have gotten away with not shutting everything down after all.  Better safe than sorry though, since there wasn't any guarantee in advance.

Next step is the final cutover, timing on that will depend on how long an incremental rsync between the two partitions takes, which will probably take me a couple days to determine.
Comment 17 Ben Hearsum (:bhearsum) 2011-01-12 08:43:58 PST
There's some fallout from this morning, http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ is empty.
Comment 18 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-12 09:28:25 PST
surf proxies it to dm-ftp01, which for some reason had the httpd docroot pointed at the mount points we removed instead of the supposed-to-be-public-facing ones.  Changed httpd to point at the correct ones, works now (as of 08:47)
Comment 19 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-14 12:22:34 PST
(In reply to comment #16)
> OK, the process in step 14 has been completed.  Turns out they were mounted in
> the correct order, so we did *not* wind up having any downtime on the
> production paths, and we could have gotten away with not shutting everything
> down after all.  Better safe than sorry though, since there wasn't any
> guarantee in advance.
> 
> Next step is the final cutover, timing on that will depend on how long an
> incremental rsync between the two partitions takes, which will probably take me
> a couple days to determine.

Any ETA?
Comment 20 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-26 12:27:56 PST
*** Bug 629129 has been marked as a duplicate of this bug. ***
Comment 21 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-26 12:29:23 PST
(In reply to comment #19)
> (In reply to comment #16)
> > OK, the process in step 14 has been completed.  Turns out they were mounted in
> > the correct order, so we did *not* wind up having any downtime on the
> > production paths, and we could have gotten away with not shutting everything
> > down after all.  Better safe than sorry though, since there wasn't any
> > guarantee in advance.
> > 
> > Next step is the final cutover, timing on that will depend on how long an
> > incremental rsync between the two partitions takes, which will probably take me
> > a couple days to determine.
> 
> Any ETA?

justdave/zandr: Any ETA? 

Bumping priority based on comment in bug#629129:
"We've got a few alerts about this partition the past couple of weeks. Right now
we're sitting at about 95G (~5%) free. We're not going to last much longer with
this though, we increase use by many GB per day, for nightlies.

I know some people, Joduinn and justdave in particular, chatted about stage
disk space in the past 6 months, but other than some new mounts for older
try/dep builds, I don't know what came out of it.

In any case, this will require action in the near future."
Comment 22 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-27 00:10:58 PST
*** Bug 629064 has been marked as a duplicate of this bug. ***
Comment 23 Ben Hearsum (:bhearsum) 2011-01-27 09:18:10 PST
Even after getting us back to > 100G yesterday, Nagios went off again:
11:56 <nagios> [47] surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 73232 MB (4%):

Elevated try load is part of this, and we'll probably gain some space on Monday when many of this weeks builds are archived to a different partition, but we'll certainly spike again next Thursday/Friday.
Comment 24 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-27 10:05:37 PST
This is being hampered by the large number of tryserver builds getting submitted in the last week or so (around 100 per day!) since those are stored on the partition we're trying to move.  An rsync of 3 days' worth just completed and took over 22 hours to complete. I've got another rsync running now picking up that 22 hours' worth of changes.  Making this happen is going to require finding a time of day when the least amount of change is happening and getting a continuous rsync going trying to get the shortest time possible for an incremental sync.  If the continuous sync doesn't manage to find a good time of day for it we may have to do something like asking people not to submit try builds for several hours in advance of the planned move time or somesuch.
Comment 25 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-28 01:34:34 PST
most recent incremental sync took 7 hours to sync 22 hours worth of data (coming straight off the one that took 22 hours to transfer 3 days' worth)
Comment 26 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-28 11:55:19 PST
timing on the "continuous run" passes over the last day or so:

time spent     completion time
------------   ----------------------------
 141m05.367s   Fri Jan 28 02:12:15 PST 2011
  70m44.413s   Fri Jan 28 03:23:00 PST 2011
  73m02.988s   Fri Jan 28 04:36:03 PST 2011
 168m53.443s   Fri Jan 28 07:24:56 PST 2011
 201m52.250s   Fri Jan 28 10:46:49 PST 2011
  52m55.436s   Fri Jan 28 11:39:44 PST 2011
Comment 27 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-28 22:06:25 PST
time spent     completion time
------------   ----------------------------
  51m19.049s   Fri Jan 28 12:31:04 PST 2011
  55m07.203s   Fri Jan 28 13:26:11 PST 2011
  50m41.688s   Fri Jan 28 14:16:53 PST 2011
  54m07.810s   Fri Jan 28 15:11:01 PST 2011
  49m55.261s   Fri Jan 28 16:00:56 PST 2011
  45m57.048s   Fri Jan 28 16:46:53 PST 2011
  44m50.918s   Fri Jan 28 17:31:44 PST 2011
  43m10.745s   Fri Jan 28 18:14:55 PST 2011
  48m11.283s   Fri Jan 28 19:03:07 PST 2011
  49m37.833s   Fri Jan 28 19:52:45 PST 2011
  46m47.324s   Fri Jan 28 20:39:32 PST 2011
  40m41.071s   Fri Jan 28 21:20:13 PST 2011
  43m31.812s   Fri Jan 28 22:03:45 PST 2011
Comment 28 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-28 22:08:29 PST
If today was a representative day, then it looks like the best time to do this is sometime between 4p and 9pm pacific, and the midnight to 11am block should be avoided at all costs.
Comment 29 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-28 22:09:25 PST
And our downtime is going to be about an hour.
Comment 30 matthew zeier [:mrz] 2011-01-30 21:46:23 PST
zandr, when's good to get this scheduled?
Comment 31 Zandr Milewski [:zandr] 2011-01-31 07:40:20 PST
(In reply to comment #30)
> zandr, when's good to get this scheduled?

Based on comment 28, this looks like a good fit for the usual Tuesday 7pm PST window. Will socialize that today so we can announce by EOD.
Comment 32 Phil Ringnalda (:philor) 2011-01-31 08:03:27 PST
If you're looking for a time when you can close the tryserver tree to get this done, note bug 630065 - you can't currently actually close it (though I guess maybe you could shut off the try buildmaster, so builds wouldn't happen even though pushes would continue).
Comment 33 Zandr Milewski [:zandr] 2011-01-31 09:43:23 PST
I'm still learning my way around the RelEng infra, so apologies if this is a dumb question:

Should bug 630065 block this downtime? Or is announcing the downtime and saying "I told you so" sufficient?
Comment 34 Ben Hearsum (:bhearsum) 2011-01-31 09:44:39 PST
We've had tons of downtimes without being able to truly close Try, I don't think we should block on that.
Comment 35 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-31 11:48:08 PST
(In reply to comment #23)
> Even after getting us back to > 100G yesterday, Nagios went off again:
> 11:56 <nagios> [47] surf:disk -
> /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING -
> free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 73232 MB (4%):
> 
> Elevated try load is part of this, and we'll probably gain some space on Monday
> when many of this weeks builds are archived to a different partition, but we'll
> certainly spike again next Thursday/Friday.


(In reply to comment #24)
> This is being hampered by the large number of tryserver builds getting
> submitted in the last week or so (around 100 per day!) since those are stored
> on the partition we're trying to move.  An rsync of 3 days' worth just
> completed and took over 22 hours to complete. I've got another rsync running
> now picking up that 22 hours' worth of changes.  Making this happen is going to
> require finding a time of day when the least amount of change is happening and
> getting a continuous rsync going trying to get the shortest time possible for
> an incremental sync.  If the continuous sync doesn't manage to find a good time
> of day for it we may have to do something like asking people not to submit try
> builds for several hours in advance of the planned move time or somesuch.

justdave, I agree there is heavy load on tryserver, but there are also heavy load on tm and m-c... all of which are posting builds under /pub/firefox. Unless I'm missing something, you will actually need to close *all* trees, not just TryServer. 

Am I missing something?
Comment 36 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-31 13:32:53 PST
(In reply to comment #35)
> justdave, I agree there is heavy load on tryserver, but there are also heavy
> load on tm and m-c... all of which are posting builds under /pub/firefox.
> Unless I'm missing something, you will actually need to close *all* trees, not
> just TryServer. 
> 
> Am I missing something?

I don't remember implying anywhere that we wouldn't have to close all trees, or that only try server would need to be.

I'll have another couple day's worth of rsync timings (it's been running continuously all weekend) in a few minutes.
Comment 37 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-31 14:08:12 PST
Here's the timings picking up from where I left off in comment 27.

time spent     completion time
------------   ----------------------------
  55m07.203s   Fri Jan 28 13:26:11 PST 2011
  50m41.688s   Fri Jan 28 14:16:53 PST 2011
  54m07.810s   Fri Jan 28 15:11:01 PST 2011
  49m55.261s   Fri Jan 28 16:00:56 PST 2011
  45m57.048s   Fri Jan 28 16:46:53 PST 2011
  44m50.918s   Fri Jan 28 17:31:44 PST 2011
  43m10.745s   Fri Jan 28 18:14:55 PST 2011
  48m11.283s   Fri Jan 28 19:03:07 PST 2011
  49m37.833s   Fri Jan 28 19:52:45 PST 2011
  46m47.324s   Fri Jan 28 20:39:32 PST 2011
  40m41.071s   Fri Jan 28 21:20:13 PST 2011
  43m31.812s   Fri Jan 28 22:03:45 PST 2011
  41m12.234s   Fri Jan 28 22:44:58 PST 2011
  42m10.788s   Fri Jan 28 23:29:58 PST 2011
  45m28.571s   Sat Jan 29 00:15:27 PST 2011
  47m43.769s   Sat Jan 29 01:03:11 PST 2011
  51m44.002s   Sat Jan 29 01:54:55 PST 2011
  48m30.270s   Sat Jan 29 02:43:25 PST 2011
  47m19.974s   Sat Jan 29 03:30:45 PST 2011
  78m19.010s   Sat Jan 29 04:49:04 PST 2011
 147m25.606s   Sat Jan 29 07:16:29 PST 2011
 174m11.984s   Sat Jan 29 10:10:41 PST 2011
  49m43.413s   Sat Jan 29 11:00:25 PST 2011
  44m13.218s   Sat Jan 29 11:44:38 PST 2011
  44m31.266s   Sat Jan 29 12:29:09 PST 2011
  46m37.220s   Sat Jan 29 13:15:47 PST 2011
  45m43.052s   Sat Jan 29 14:01:30 PST 2011
  44m22.308s   Sat Jan 29 14:45:52 PST 2011
  46m33.268s   Sat Jan 29 15:32:25 PST 2011
  43m45.723s   Sat Jan 29 16:16:11 PST 2011
  44m19.970s   Sat Jan 29 17:00:31 PST 2011
  43m49.773s   Sat Jan 29 17:44:21 PST 2011
  42m59.756s   Sat Jan 29 18:27:21 PST 2011
  43m22.887s   Sat Jan 29 19:10:44 PST 2011
  42m27.148s   Sat Jan 29 19:53:11 PST 2011
  40m55.606s   Sat Jan 29 20:34:06 PST 2011
  42m25.584s   Sat Jan 29 21:16:32 PST 2011
  39m36.874s   Sat Jan 29 21:56:09 PST 2011
  37m14.723s   Sat Jan 29 22:33:24 PST 2011
  37m37.328s   Sat Jan 29 23:11:01 PST 2011
  38m45.227s   Sat Jan 29 23:49:46 PST 2011
  43m01.181s   Sun Jan 30 00:32:47 PST 2011
  43m52.116s   Sun Jan 30 01:16:39 PST 2011
  48m28.173s   Sun Jan 30 02:05:08 PST 2011
  44m39.384s   Sun Jan 30 02:49:47 PST 2011
  48m05.095s   Sun Jan 30 03:37:52 PST 2011
  72m21.407s   Sun Jan 30 04:50:14 PST 2011
 144m43.013s   Sun Jan 30 07:14:57 PST 2011
 149m49.792s   Sun Jan 30 09:44:46 PST 2011
  45m24.427s   Sun Jan 30 10:30:11 PST 2011
  44m39.833s   Sun Jan 30 11:14:51 PST 2011
  42m42.739s   Sun Jan 30 11:57:34 PST 2011
  43m49.922s   Sun Jan 30 12:41:23 PST 2011
  43m28.116s   Sun Jan 30 13:24:52 PST 2011
  40m53.771s   Sun Jan 30 14:05:45 PST 2011
  39m38.953s   Sun Jan 30 14:45:24 PST 2011
  39m30.793s   Sun Jan 30 15:24:55 PST 2011
  39m24.620s   Sun Jan 30 16:04:20 PST 2011
  41m37.206s   Sun Jan 30 16:45:57 PST 2011
  44m25.262s   Sun Jan 30 17:30:22 PST 2011
  40m42.905s   Sun Jan 30 18:11:05 PST 2011
  40m49.902s   Sun Jan 30 18:51:55 PST 2011
  40m38.700s   Sun Jan 30 19:32:34 PST 2011
  40m24.559s   Sun Jan 30 20:12:58 PST 2011
  41m15.890s   Sun Jan 30 20:54:14 PST 2011
  38m23.992s   Sun Jan 30 21:32:38 PST 2011
  40m51.554s   Sun Jan 30 22:13:30 PST 2011
  39m13.943s   Sun Jan 30 22:52:44 PST 2011
  40m31.389s   Sun Jan 30 23:33:15 PST 2011
  42m41.015s   Mon Jan 31 00:15:56 PST 2011
  43m56.014s   Mon Jan 31 00:59:52 PST 2011
  44m22.505s   Mon Jan 31 01:44:15 PST 2011
  39m27.340s   Mon Jan 31 02:23:42 PST 2011
  40m51.176s   Mon Jan 31 03:04:33 PST 2011
  46m21.102s   Mon Jan 31 03:50:55 PST 2011
  114m4.194s   Mon Jan 31 05:44:59 PST 2011
 243m16.968s   Mon Jan 31 09:48:16 PST 2011
 194m47.532s   Mon Jan 31 13:03:03 PST 2011
Comment 38 Zandr Milewski [:zandr] 2011-01-31 14:25:56 PST
This is consistent with our current theory, which is that the nightlies create a huge amount of stuff to move, and it takes hours to push through.
Comment 39 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-01-31 14:37:01 PST
1) justdave, thanks - this is great data.

2) from releng+zandr meeting : this means we can do this downtime anytime *except* the few hours after the nightlies are created. 

3) We're proposing doing the downtime from 6-9am PST as this is lowest checkin load, so least disruptive to developers. Open question is:
3a) should we trigger the nightlies earlier (say 1am PST?), so that the rsync would be handled well in advance of the downtime? OR
3b) delay triggering the nightlies until after the downtime is over (say 9am PST)? This would mean handling nightly build+l10n load at the same time as developers start usual checkin load, so (3b) feels less optimal to me. Therefore I propose we do (3a). 

Any comments, thoughts before we cast this in stone?
Comment 40 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-31 21:43:26 PST
From the data we have so far it looks like we need to wait until at least 11am if you want the downtime to be less than an hour, that or trigger the nightlies that much earlier or after we're done.
Comment 41 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-01-31 21:46:06 PST
And that's on a weekend.  The weekday data we have so far (Monday) seems to imply that 4am to 1pm is off limits.  Note that the times listed on that chart are when the rsync completed.  It's running in a loop, so the end time of the previous pass is the start time of the one whose length of time is listed there (within a few seconds)
Comment 42 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-02-01 11:02:35 PST
(In reply to comment #40)
> From the data we have so far it looks like we need to wait until at least 11am
> if you want the downtime to be less than an hour, that or trigger the nightlies
> that much earlier or after we're done.

Justdave: We're totally fine with triggering nightlies earlier/later for that one day (thursday), just to make this ftp-sync and final switchover happen. Given that, could we do this during the Thursday morning downtime?
Comment 44 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-02 16:26:52 PST
here's the procedure I'm planning on:

1) At 6:00am PDT, I put in an /etc/nologin file on surf to prevent ssh/scp/rsync-over-ssh connections from coming in.
2) The continuous rsync loop already in progress will be allowed to complete.
3) An additional loop will be allowed to complete with the entire loop happening while no one can upload.  This will ensure that every last bit of the data has been copied over.
4) httpd, vsftpd, and xinetd(rsync) will be shut down on all of surf, dm-ftp01, and dm-download02, to remove all readers and allow me to unmount the partitions.
5) The NFS mounts to dm-ftp01 will be dropped from surf and dm-download02 (FREAKING HURRAY!!!!)
6) Both partitions will be unmounted from dm-ftp01
7) the new partition will be mounted in place of the dm-ftp01 NFS mount on all three servers (in the case of dm-ftp01 this is in place of the iscsi mount)
8) the old partition will be mounted on surf in a separate out-of-tree mount point to allow any last minute cleanup or retrieval of missing items we didn't catch in step 3 (really unlikely, but better safe than sorry).
9) httpd, vsftpd, and xinetd(rsync) will be re-enabled on all servers where applicable
10) /etc/nologin will be removed from surf to allow uploads again.
11) profit!
Comment 45 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-02 16:29:14 PST
11) permanently disable nfsd on dm-ftp01 :)
Comment 46 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-02-02 16:34:20 PST
(In reply to comment #42)
> (In reply to comment #40)
> > From the data we have so far it looks like we need to wait until at least 11am
> > if you want the downtime to be less than an hour, that or trigger the nightlies
> > that much earlier or after we're done.
> 
> Justdave: We're totally fine with triggering nightlies earlier/later for that
> one day (thursday), just to make this ftp-sync and final switchover happen.
> Given that, could we do this during the Thursday morning downtime?

Nightly scheduler now tweaked to fire after tonight's downtime is over. This should help reduce the amount of rsync-ing needed.
Comment 47 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-03 07:55:41 PST
ok, completed through step 10.  Disabling nfsd on dm-ftp01 will need to wait until we decide we're done with the old mount (which is at /mnt/eql/builds on surf and dm-ftp01)
Comment 48 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-03 08:04:48 PST
the reverse-proxies to dm-ftp01 from dm-download02 and surf for the /pub/mozilla.org/firefox directory have been removed (serving locally off the netapp nfs mount now)
Comment 49 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-03 08:13:04 PST
ok, going to call this done.
Comment 50 Aki Sasaki [:aki] back dec19 2011-02-07 13:27:10 PST
This appears wrong:

[ffxbld@surf firefox]$ df -h . tinderbox-builds tryserver-builds
Filesystem            Size  Used Avail Use% Mounted on
10.253.0.11:/vol/stage
                      3.1T  1.8T  1.3T  58% /mnt/netapp/stage/archive.mozilla.org/pub/firefox
/mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds
                       17T  807G   16T   5% /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tinderbox-builds
10.253.0.11:/vol/stage
                      3.1T  1.8T  1.3T  58% /mnt/netapp/stage/archive.mozilla.org/pub/firefox

John O'Duinn wanted to have tinderbox-builds mounted on HA disk, and moved to tinderbox-builds/old (on 16T non-HA disk) at 14-20 days.  To me that means tinderbox-builds should be mounting /vol/stage, and cm-ixstore01 should be mounted on tinderbox-builds/old.

The reason for this is that cm-ixstore01 is, aiui, on a single head and if we lose that, we're looking at a day of downtime and a burning tree.
Comment 51 Dave Miller [:justdave] (justdave@bugzilla.org) 2011-02-07 22:22:26 PST
That's the way it's been since those got set up, I didn't touch those mount points (other than unmounting them around the swap of the other two).  The setup of those was on a different bug (I don't know which one, I didn't do it).  I'd suggest reopening that one if you can find it, or filing a new bug.

Note You need to log in before you can comment on or make changes to this bug.