Closed Bug 614786 Opened 14 years ago Closed 14 years ago

Rotate ftp staging site to new disk array

Categories

(mozilla.org Graveyard :: Server Operations, task)

Product:

Component:

Platform:

x86

Linux

Type:

task

Priority:

Not set

Severity:

major

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: justdave, Assigned: justdave)

References

Details

(Whiteboard: [downtime 3 of 3 on Thu 2/3 6am pdt])

Dave Miller [:justdave]

Assignee

Description

•

14 years ago

Here's the current situation: Available disk arrays: Filesystem Size Used Avail Use% [A] Netapp #1 3.1T 2.7T 408G 87% [B] EQL via NFS 2.0T 1.8T 247G 88% [C] Netapp #2 4.8T 2.8T 2.0T 59% Currently mounted as: [A] /pub/mozilla.org [B] /pub/mozilla.org/firefox [C] not yet in use With 5.1 TB of total space available The plan: [A] /pub/mozilla.org/firefox [B] going away [C] /pub/mozilla.org With 7.9 TB of total space available The original plan was to recombine everything onto [C], but since we're already using 4.5 TB (out of an available 4.8 TB on the new drive), I think it makes more sense to keep the old netapp array in the mix, and eliminate the iscsi-over-nfs hack. Doing it this way, in addition to eliminating the performance issues with the iscsi-over-nfs it's currently using, will add 1.8 TB of disk capacity instead of removing 0.3 TB. Doing this move is going to require TWO downtimes. 1) Move [A] to [C] 2) Move [B] to [A] We obviously can't do #2 until #1 is done, and there will be additional prep required between the two steps. An initial sync of [A] to [C] has already been completed (as evident by the disk usage in the table at the top). Incremental syncs have been tested to take approximately 70 minutes per run. In order to ensure no dataloss we'll need to ensure nobody can write to the disk during the final sync before remounting the drives in their swapped positions. I would recommend advertising a 2 hour outage for this, and we'll probably have it up and running again way sooner than that. Our technology has improved. The last time we did this (with only 1.5 TB of data at the time) it took over 6 hours. :) The amount of time required for the [B]-to-[A] move is unknown, and there's no way to test it until [A] is freed up by the first move. I suspect it'll take longer, despite the smaller dataset, because of the NFS-via-Linux step in the middle getting the data off the old drive; but that's only a theory until it's actually tested.

Flags: needs-treeclosure?

Flags: needs-downtime+

Dave Miller [:justdave]

Assignee

Comment 1

•

14 years ago

A graphical diagram of the current setup is available at http://people.mozilla.org/~justdave/MirrorNetwork.pdf

Dave Miller [:justdave]

Assignee

Updated

•

14 years ago

Whiteboard: [pending scheduling of downtime]

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

14 years ago

Step 1 sounds like it can be done in whatever the next downtime is. I'm on buildduty most of next week, we can probably figure something out. Could Step 2 wait until the holidays, when it's easier to get longer downtime windows?

Dave Miller [:justdave]

Assignee

Comment 3

•

14 years ago

Sure. We'll probably want to wait until we get a trial run on the incremental syncs for step 2 before deciding how long to wait. It may surprise us and go faster for all we know. Then again, it may not. :)

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

14 years ago

Sounds like a good plan to me, with the added advantage that the netapp partitions can be resized (storage permitting). (In reply to comment #0) > The amount of time required for the [B]-to-[A] move is unknown, and there's no > way to test it until [A] is freed up by the first move. I suspect it'll take > longer, despite the smaller dataset, because of the NFS-via-Linux step in the > middle getting the data off the old drive; but that's only a theory until it's > actually tested. No doubt you already thought of doing the syncs on dm-ftp01.m.o to avoid the extra NFS hop.

Dave Miller [:justdave]

Assignee

Comment 5

•

14 years ago

(In reply to comment #4) > No doubt you already thought of doing the syncs on dm-ftp01.m.o to avoid the > extra NFS hop. Actually, the thought had slipped my mind, but that's a good idea. We'll have to fix the ACLs to allow us to mount it read/write over there (it's read-only currently), but that's certainly doable.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

14 years ago

justdave: could this be done on 17th? zandr will be doing a tree-closing downtime in bug#616658 that day, so it would be great to do this at the same time.

Dave Miller [:justdave]

Assignee

Comment 7

•

14 years ago

Depends on the time of day. I'll be doing my RHEL6 recertification exam for my RHCE that day, 9am to 4:30pm Central time, and given that it's downtown Chicago, I'd allow at least 90 minutes travel time to get back to my sister's place and get online afterwards. So I guess if we're talking after 5pm pacific it'd probably work.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

14 years ago

(In reply to comment #6) > justdave: could this be done on 17th? zandr will be doing a tree-closing > downtime in bug#616658 that day, so it would be great to do this at the same > time. (In reply to comment #7) > Depends on the time of day. I'll be doing my RHEL6 recertification exam for my > RHCE that day, 9am to 4:30pm Central time, and given that it's downtown > Chicago, I'd allow at least 90 minutes travel time to get back to my sister's > place and get online afterwards. So I guess if we're talking after 5pm pacific > it'd probably work. Per zandr, the downtime will be from 8am to 5pm (Pacific), but that includes time for spinning back up systems after the recabling work is finished. On the 17th, your window would be from 8am to 2pm (Pacific). Does that work for you? If not, is there someone else in IT who can do this on your behalf?

Dave Miller [:justdave]

Assignee

Comment 9

•

14 years ago

10am pacific might work. I've actually got two separate exams, one is 9:00 to 11:30 central, the other 2:00 to 4:30 central, so other than grabbing lunch, I'll basically be sitting around doing nothing for 2.5 hours between the two exams. With the estimated runtime for the switch being 90 minutes that'll probably be enough time to do it.

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

14 years ago

Happening today.

Flags: needs-treeclosure? → needs-treeclosure+

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Whiteboard: [pending scheduling of downtime]

Dave Miller [:justdave]

Assignee

Comment 12

•

14 years ago

Part one happened yesterday. Part 2 will depend on timing figuring out how long it'll take to sync the filesystems. I expect the initial sync to take a day or two, and the followup syncs will determine how long of an outage we need. Is RelEng happy with the state of stage right now? (data integrity I mean). The next step is to wipe out the contents of the array we just vacated in prep for copying the firefox stuff into it, and I want to make sure we don't need it for a data reversion or something first.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Depends on: 617626

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

14 years ago

(In reply to comment #12) > Part one happened yesterday. > > Part 2 will depend on timing figuring out how long it'll take to sync the > filesystems. I expect the initial sync to take a day or two, and the followup > syncs will determine how long of an outage we need. > > Is RelEng happy with the state of stage right now? (data integrity I mean). > The next step is to wipe out the contents of the array we just vacated in prep > for copying the firefox stuff into it, and I want to make sure we don't need it > for a data reversion or something first. Per IRC, we're happy with things and haven't seen any issues. Go ahead.

Dave Miller [:justdave]

Assignee

Comment 14

•

14 years ago

ok, so to cleanly copy this stuff over to the new partition, I need to remove a couple of the bind mounts on dm-ftp01. This *shouldn't* affect anything visible to production, but it depends on the order the mounts were initially set up, and there's a really slim chance that the tryserver and tinderbox directories might briefly disappear. > * 10.253.0.139:/data/try-builds on /mnt/cm-ixstore01/try-builds type nfs (rw,noatime,rsize=32768,wsize=32768,nfsvers=3,proto=tcp,addr=10.253.0.139) > * 10.253.0.139:/data/tinderbox-builds on /mnt/cm-ixstore01/tinderbox-builds type nfs (rw,noatime,rsize=32768,wsize=32768,nfsvers=3,proto=tcp,addr=10.253.0.139) > * /mnt/eql/builds/firefox on /mnt/netapp/stage/archive.mozilla.org/pub/firefox type bind (ro,bind,_netdev) > X /mnt/cm-ixstore01/try-builds/trybuilds on /mnt/eql/builds/firefox/tryserver-builds/old type none (rw,bind) > * /mnt/cm-ixstore01/try-builds/trybuilds on /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tryserver-builds/old type bind (ro,bind,_netdev) > X /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds on /mnt/eql/builds/firefox/tinderbox-builds type bind (ro,bind,_netdev) > * /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds on /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tinderbox-builds type bind (ro,bind,_netdev) The two with the X in front are the two I need to get rid of. The ones under /mnt/netapp/stage are the ones that are visible on stage.m.o and ftp.m.o. *IF* the ixstore mounts were mounted into eql before eql was mounted into netapp, *THEN* there's a chance that those directories will disappear from netapp when I unmount them from eql, which will require the netapp versions of those bind mounts to be unmounted and remounted. If they were mounted afterwards, then they won't disappear and the production directories won't be affected. Just to be safe, we're scheduling a downtime to do the unmounts. This is tentatively Wed Jan 12 during EST AM.

Whiteboard: [downtime 2 of 3 on Jan 12)

Mike Taylor [:bear]

Comment 15

•

14 years ago

We may not be able to hit this downtime because of the requirement that even tho we have all our ducks in a row we still have to run this completely up the chain-of-command flagpole. So, started that process just now and have tossed the ball to zandr since he can better coordinate with IT - you guys let me know when this gets scheduled.

Dave Miller [:justdave]

Assignee

Comment 16

•

14 years ago

OK, the process in step 14 has been completed. Turns out they were mounted in the correct order, so we did *not* wind up having any downtime on the production paths, and we could have gotten away with not shutting everything down after all. Better safe than sorry though, since there wasn't any guarantee in advance. Next step is the final cutover, timing on that will depend on how long an incremental rsync between the two partitions takes, which will probably take me a couple days to determine.

Whiteboard: [downtime 2 of 3 on Jan 12) → [downtime 3 of 3 on ???) [waiting for duration to be figured out]

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

14 years ago

There's some fallout from this morning, http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ is empty.

Dave Miller [:justdave]

Assignee

Comment 18

•

14 years ago

surf proxies it to dm-ftp01, which for some reason had the httpd docroot pointed at the mount points we removed instead of the supposed-to-be-public-facing ones. Changed httpd to point at the correct ones, works now (as of 08:47)

matthew zeier [:mrz]

Updated

•

14 years ago

Assignee: justdave → zandr

Zandr Milewski [:zandr]

Updated

•

14 years ago

Blocks: 601025

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 19

•

14 years ago

(In reply to comment #16) > OK, the process in step 14 has been completed. Turns out they were mounted in > the correct order, so we did *not* wind up having any downtime on the > production paths, and we could have gotten away with not shutting everything > down after all. Better safe than sorry though, since there wasn't any > guarantee in advance. > > Next step is the final cutover, timing on that will depend on how long an > incremental rsync between the two partitions takes, which will probably take me > a couple days to determine. Any ETA?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Blocks: 625979

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 21

•

14 years ago

(In reply to comment #19) > (In reply to comment #16) > > OK, the process in step 14 has been completed. Turns out they were mounted in > > the correct order, so we did *not* wind up having any downtime on the > > production paths, and we could have gotten away with not shutting everything > > down after all. Better safe than sorry though, since there wasn't any > > guarantee in advance. > > > > Next step is the final cutover, timing on that will depend on how long an > > incremental rsync between the two partitions takes, which will probably take me > > a couple days to determine. > > Any ETA? justdave/zandr: Any ETA? Bumping priority based on comment in bug#629129: "We've got a few alerts about this partition the past couple of weeks. Right now we're sitting at about 95G (~5%) free. We're not going to last much longer with this though, we increase use by many GB per day, for nightlies. I know some people, Joduinn and justdave in particular, chatted about stage disk space in the past 6 months, but other than some new mounts for older try/dep builds, I don't know what came out of it. In any case, this will require action in the near future."

Severity: normal → major

bhearsum@mozilla.com (:bhearsum)

Comment 23

•

14 years ago

Even after getting us back to > 100G yesterday, Nagios went off again: 11:56 <nagios> [47] surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 73232 MB (4%): Elevated try load is part of this, and we'll probably gain some space on Monday when many of this weeks builds are archived to a different partition, but we'll certainly spike again next Thursday/Friday.

Dave Miller [:justdave]

Assignee

Comment 24

•

14 years ago

This is being hampered by the large number of tryserver builds getting submitted in the last week or so (around 100 per day!) since those are stored on the partition we're trying to move. An rsync of 3 days' worth just completed and took over 22 hours to complete. I've got another rsync running now picking up that 22 hours' worth of changes. Making this happen is going to require finding a time of day when the least amount of change is happening and getting a continuous rsync going trying to get the shortest time possible for an incremental sync. If the continuous sync doesn't manage to find a good time of day for it we may have to do something like asking people not to submit try builds for several hours in advance of the planned move time or somesuch.

Dave Miller [:justdave]

Assignee

Comment 25

•

14 years ago

most recent incremental sync took 7 hours to sync 22 hours worth of data (coming straight off the one that took 22 hours to transfer 3 days' worth)

Dave Miller [:justdave]

Assignee

Comment 26

•

14 years ago

timing on the "continuous run" passes over the last day or so: time spent completion time ------------ ---------------------------- 141m05.367s Fri Jan 28 02:12:15 PST 2011 70m44.413s Fri Jan 28 03:23:00 PST 2011 73m02.988s Fri Jan 28 04:36:03 PST 2011 168m53.443s Fri Jan 28 07:24:56 PST 2011 201m52.250s Fri Jan 28 10:46:49 PST 2011 52m55.436s Fri Jan 28 11:39:44 PST 2011

Dave Miller [:justdave]

Assignee

Comment 27

•

14 years ago

time spent completion time ------------ ---------------------------- 51m19.049s Fri Jan 28 12:31:04 PST 2011 55m07.203s Fri Jan 28 13:26:11 PST 2011 50m41.688s Fri Jan 28 14:16:53 PST 2011 54m07.810s Fri Jan 28 15:11:01 PST 2011 49m55.261s Fri Jan 28 16:00:56 PST 2011 45m57.048s Fri Jan 28 16:46:53 PST 2011 44m50.918s Fri Jan 28 17:31:44 PST 2011 43m10.745s Fri Jan 28 18:14:55 PST 2011 48m11.283s Fri Jan 28 19:03:07 PST 2011 49m37.833s Fri Jan 28 19:52:45 PST 2011 46m47.324s Fri Jan 28 20:39:32 PST 2011 40m41.071s Fri Jan 28 21:20:13 PST 2011 43m31.812s Fri Jan 28 22:03:45 PST 2011

Dave Miller [:justdave]

Assignee

Comment 28

•

14 years ago

If today was a representative day, then it looks like the best time to do this is sometime between 4p and 9pm pacific, and the midnight to 11am block should be avoided at all costs.

Dave Miller [:justdave]

Assignee

Comment 29

•

14 years ago

And our downtime is going to be about an hour.

matthew zeier [:mrz]

Comment 30

•

14 years ago

zandr, when's good to get this scheduled?

Zandr Milewski [:zandr]

Comment 31

•

14 years ago

(In reply to comment #30) > zandr, when's good to get this scheduled? Based on comment 28, this looks like a good fit for the usual Tuesday 7pm PST window. Will socialize that today so we can announce by EOD.

Phil Ringnalda (:philor)

Comment 32

•

14 years ago

If you're looking for a time when you can close the tryserver tree to get this done, note bug 630065 - you can't currently actually close it (though I guess maybe you could shut off the try buildmaster, so builds wouldn't happen even though pushes would continue).

Zandr Milewski [:zandr]

Comment 33

•

14 years ago

I'm still learning my way around the RelEng infra, so apologies if this is a dumb question: Should bug 630065 block this downtime? Or is announcing the downtime and saying "I told you so" sufficient?

bhearsum@mozilla.com (:bhearsum)

Comment 34

•

14 years ago

We've had tons of downtimes without being able to truly close Try, I don't think we should block on that.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 35

•

14 years ago

(In reply to comment #23) > Even after getting us back to > 100G yesterday, Nagios went off again: > 11:56 <nagios> [47] surf:disk - > /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - > free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 73232 MB (4%): > > Elevated try load is part of this, and we'll probably gain some space on Monday > when many of this weeks builds are archived to a different partition, but we'll > certainly spike again next Thursday/Friday. (In reply to comment #24) > This is being hampered by the large number of tryserver builds getting > submitted in the last week or so (around 100 per day!) since those are stored > on the partition we're trying to move. An rsync of 3 days' worth just > completed and took over 22 hours to complete. I've got another rsync running > now picking up that 22 hours' worth of changes. Making this happen is going to > require finding a time of day when the least amount of change is happening and > getting a continuous rsync going trying to get the shortest time possible for > an incremental sync. If the continuous sync doesn't manage to find a good time > of day for it we may have to do something like asking people not to submit try > builds for several hours in advance of the planned move time or somesuch. justdave, I agree there is heavy load on tryserver, but there are also heavy load on tm and m-c... all of which are posting builds under /pub/firefox. Unless I'm missing something, you will actually need to close *all* trees, not just TryServer. Am I missing something?

Dave Miller [:justdave]

Assignee

Comment 36

•

14 years ago

(In reply to comment #35) > justdave, I agree there is heavy load on tryserver, but there are also heavy > load on tm and m-c... all of which are posting builds under /pub/firefox. > Unless I'm missing something, you will actually need to close *all* trees, not > just TryServer. > > Am I missing something? I don't remember implying anywhere that we wouldn't have to close all trees, or that only try server would need to be. I'll have another couple day's worth of rsync timings (it's been running continuously all weekend) in a few minutes.

Dave Miller [:justdave]

Assignee

Comment 37

•

14 years ago

Here's the timings picking up from where I left off in comment 27. time spent completion time ------------ ---------------------------- 55m07.203s Fri Jan 28 13:26:11 PST 2011 50m41.688s Fri Jan 28 14:16:53 PST 2011 54m07.810s Fri Jan 28 15:11:01 PST 2011 49m55.261s Fri Jan 28 16:00:56 PST 2011 45m57.048s Fri Jan 28 16:46:53 PST 2011 44m50.918s Fri Jan 28 17:31:44 PST 2011 43m10.745s Fri Jan 28 18:14:55 PST 2011 48m11.283s Fri Jan 28 19:03:07 PST 2011 49m37.833s Fri Jan 28 19:52:45 PST 2011 46m47.324s Fri Jan 28 20:39:32 PST 2011 40m41.071s Fri Jan 28 21:20:13 PST 2011 43m31.812s Fri Jan 28 22:03:45 PST 2011 41m12.234s Fri Jan 28 22:44:58 PST 2011 42m10.788s Fri Jan 28 23:29:58 PST 2011 45m28.571s Sat Jan 29 00:15:27 PST 2011 47m43.769s Sat Jan 29 01:03:11 PST 2011 51m44.002s Sat Jan 29 01:54:55 PST 2011 48m30.270s Sat Jan 29 02:43:25 PST 2011 47m19.974s Sat Jan 29 03:30:45 PST 2011 78m19.010s Sat Jan 29 04:49:04 PST 2011 147m25.606s Sat Jan 29 07:16:29 PST 2011 174m11.984s Sat Jan 29 10:10:41 PST 2011 49m43.413s Sat Jan 29 11:00:25 PST 2011 44m13.218s Sat Jan 29 11:44:38 PST 2011 44m31.266s Sat Jan 29 12:29:09 PST 2011 46m37.220s Sat Jan 29 13:15:47 PST 2011 45m43.052s Sat Jan 29 14:01:30 PST 2011 44m22.308s Sat Jan 29 14:45:52 PST 2011 46m33.268s Sat Jan 29 15:32:25 PST 2011 43m45.723s Sat Jan 29 16:16:11 PST 2011 44m19.970s Sat Jan 29 17:00:31 PST 2011 43m49.773s Sat Jan 29 17:44:21 PST 2011 42m59.756s Sat Jan 29 18:27:21 PST 2011 43m22.887s Sat Jan 29 19:10:44 PST 2011 42m27.148s Sat Jan 29 19:53:11 PST 2011 40m55.606s Sat Jan 29 20:34:06 PST 2011 42m25.584s Sat Jan 29 21:16:32 PST 2011 39m36.874s Sat Jan 29 21:56:09 PST 2011 37m14.723s Sat Jan 29 22:33:24 PST 2011 37m37.328s Sat Jan 29 23:11:01 PST 2011 38m45.227s Sat Jan 29 23:49:46 PST 2011 43m01.181s Sun Jan 30 00:32:47 PST 2011 43m52.116s Sun Jan 30 01:16:39 PST 2011 48m28.173s Sun Jan 30 02:05:08 PST 2011 44m39.384s Sun Jan 30 02:49:47 PST 2011 48m05.095s Sun Jan 30 03:37:52 PST 2011 72m21.407s Sun Jan 30 04:50:14 PST 2011 144m43.013s Sun Jan 30 07:14:57 PST 2011 149m49.792s Sun Jan 30 09:44:46 PST 2011 45m24.427s Sun Jan 30 10:30:11 PST 2011 44m39.833s Sun Jan 30 11:14:51 PST 2011 42m42.739s Sun Jan 30 11:57:34 PST 2011 43m49.922s Sun Jan 30 12:41:23 PST 2011 43m28.116s Sun Jan 30 13:24:52 PST 2011 40m53.771s Sun Jan 30 14:05:45 PST 2011 39m38.953s Sun Jan 30 14:45:24 PST 2011 39m30.793s Sun Jan 30 15:24:55 PST 2011 39m24.620s Sun Jan 30 16:04:20 PST 2011 41m37.206s Sun Jan 30 16:45:57 PST 2011 44m25.262s Sun Jan 30 17:30:22 PST 2011 40m42.905s Sun Jan 30 18:11:05 PST 2011 40m49.902s Sun Jan 30 18:51:55 PST 2011 40m38.700s Sun Jan 30 19:32:34 PST 2011 40m24.559s Sun Jan 30 20:12:58 PST 2011 41m15.890s Sun Jan 30 20:54:14 PST 2011 38m23.992s Sun Jan 30 21:32:38 PST 2011 40m51.554s Sun Jan 30 22:13:30 PST 2011 39m13.943s Sun Jan 30 22:52:44 PST 2011 40m31.389s Sun Jan 30 23:33:15 PST 2011 42m41.015s Mon Jan 31 00:15:56 PST 2011 43m56.014s Mon Jan 31 00:59:52 PST 2011 44m22.505s Mon Jan 31 01:44:15 PST 2011 39m27.340s Mon Jan 31 02:23:42 PST 2011 40m51.176s Mon Jan 31 03:04:33 PST 2011 46m21.102s Mon Jan 31 03:50:55 PST 2011 114m4.194s Mon Jan 31 05:44:59 PST 2011 243m16.968s Mon Jan 31 09:48:16 PST 2011 194m47.532s Mon Jan 31 13:03:03 PST 2011

Zandr Milewski [:zandr]

Comment 38

•

14 years ago

This is consistent with our current theory, which is that the nightlies create a huge amount of stuff to move, and it takes hours to push through.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 39

•

14 years ago

1) justdave, thanks - this is great data. 2) from releng+zandr meeting : this means we can do this downtime anytime *except* the few hours after the nightlies are created. 3) We're proposing doing the downtime from 6-9am PST as this is lowest checkin load, so least disruptive to developers. Open question is: 3a) should we trigger the nightlies earlier (say 1am PST?), so that the rsync would be handled well in advance of the downtime? OR 3b) delay triggering the nightlies until after the downtime is over (say 9am PST)? This would mean handling nightly build+l10n load at the same time as developers start usual checkin load, so (3b) feels less optimal to me. Therefore I propose we do (3a). Any comments, thoughts before we cast this in stone?

Dave Miller [:justdave]

Assignee

Comment 40

•

14 years ago

From the data we have so far it looks like we need to wait until at least 11am if you want the downtime to be less than an hour, that or trigger the nightlies that much earlier or after we're done.

Dave Miller [:justdave]

Assignee

Comment 41

•

14 years ago

And that's on a weekend. The weekday data we have so far (Monday) seems to imply that 4am to 1pm is off limits. Note that the times listed on that chart are when the rsync completed. It's running in a loop, so the end time of the previous pass is the start time of the one whose length of time is listed there (within a few seconds)

Chris Cooper [:coop] (he/him)

Updated

•

14 years ago

Blocks: 630538

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 42

•

14 years ago

(In reply to comment #40) > From the data we have so far it looks like we need to wait until at least 11am > if you want the downtime to be less than an hour, that or trigger the nightlies > that much earlier or after we're done. Justdave: We're totally fine with triggering nightlies earlier/later for that one day (thursday), just to make this ftp-sync and final switchover happen. Given that, could we do this during the Thursday morning downtime?

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

14 years ago

Blocks: 630541

Dave Miller [:justdave]

Assignee

Comment 43

•

14 years ago

http://blog.mozilla.com/it/2011/02/01/mozilla-scheduled-maintenance-tree-closure-232011-6am-pst-1400-utc-02032011/

Whiteboard: [downtime 3 of 3 on ???) [waiting for duration to be figured out] → [downtime 3 of 3 on Thu 2/3 6am pdt]

Dave Miller [:justdave]

Assignee

Comment 44

•

14 years ago

here's the procedure I'm planning on: 1) At 6:00am PDT, I put in an /etc/nologin file on surf to prevent ssh/scp/rsync-over-ssh connections from coming in. 2) The continuous rsync loop already in progress will be allowed to complete. 3) An additional loop will be allowed to complete with the entire loop happening while no one can upload. This will ensure that every last bit of the data has been copied over. 4) httpd, vsftpd, and xinetd(rsync) will be shut down on all of surf, dm-ftp01, and dm-download02, to remove all readers and allow me to unmount the partitions. 5) The NFS mounts to dm-ftp01 will be dropped from surf and dm-download02 (FREAKING HURRAY!!!!) 6) Both partitions will be unmounted from dm-ftp01 7) the new partition will be mounted in place of the dm-ftp01 NFS mount on all three servers (in the case of dm-ftp01 this is in place of the iscsi mount) 8) the old partition will be mounted on surf in a separate out-of-tree mount point to allow any last minute cleanup or retrieval of missing items we didn't catch in step 3 (really unlikely, but better safe than sorry). 9) httpd, vsftpd, and xinetd(rsync) will be re-enabled on all servers where applicable 10) /etc/nologin will be removed from surf to allow uploads again. 11) profit!

Dave Miller [:justdave]

Assignee

Comment 45

•

14 years ago

11) permanently disable nfsd on dm-ftp01 :)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 46

•

14 years ago

(In reply to comment #42) > (In reply to comment #40) > > From the data we have so far it looks like we need to wait until at least 11am > > if you want the downtime to be less than an hour, that or trigger the nightlies > > that much earlier or after we're done. > > Justdave: We're totally fine with triggering nightlies earlier/later for that > one day (thursday), just to make this ftp-sync and final switchover happen. > Given that, could we do this during the Thursday morning downtime? Nightly scheduler now tweaked to fire after tonight's downtime is over. This should help reduce the amount of rsync-ing needed.

Dave Miller [:justdave]

Assignee

Comment 47

•

14 years ago

ok, completed through step 10. Disabling nfsd on dm-ftp01 will need to wait until we decide we're done with the old mount (which is at /mnt/eql/builds on surf and dm-ftp01)

Dave Miller [:justdave]

Assignee

Comment 48

•

14 years ago

the reverse-proxies to dm-ftp01 from dm-download02 and surf for the /pub/mozilla.org/firefox directory have been removed (serving locally off the netapp nfs mount now)

Dave Miller [:justdave]

Assignee

Comment 49

•

14 years ago

ok, going to call this done.

Assignee: zandr → justdave

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Aki Sasaki (not active)

Comment 50

•

14 years ago

This appears wrong: [ffxbld@surf firefox]$ df -h . tinderbox-builds tryserver-builds Filesystem Size Used Avail Use% Mounted on 10.253.0.11:/vol/stage 3.1T 1.8T 1.3T 58% /mnt/netapp/stage/archive.mozilla.org/pub/firefox /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds 17T 807G 16T 5% /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tinderbox-builds 10.253.0.11:/vol/stage 3.1T 1.8T 1.3T 58% /mnt/netapp/stage/archive.mozilla.org/pub/firefox John O'Duinn wanted to have tinderbox-builds mounted on HA disk, and moved to tinderbox-builds/old (on 16T non-HA disk) at 14-20 days. To me that means tinderbox-builds should be mounting /vol/stage, and cm-ixstore01 should be mounted on tinderbox-builds/old. The reason for this is that cm-ixstore01 is, aiui, on a single head and if we lose that, we're looking at a day of downtime and a burning tree.

Dave Miller [:justdave]

Assignee

Comment 51

•

14 years ago

That's the way it's been since those got set up, I didn't touch those mount points (other than unmounting them around the swap of the other two). The setup of those was on a different bug (I don't know which one, I didn't do it). I'd suggest reopening that one if you can find it, or filing a new bug.

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.