Closed Bug 971684 Opened 10 years ago Closed 10 years ago

Rearrange product delivery mounts to remove nested client-side mounts

Categories

(Infrastructure & Operations :: Change Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gcox, Assigned: gcox)

References

Details

Having the client (ftp/rsync/upload) mount volume on top of volume creates problems in managing the back end storage.  So, for the TCW we want to start fixing that.  This involves...

Rearranging
/mnt/netapp/stage/archive.mozilla.org/pub/seamonkey
/mnt/netapp/stage/archive.mozilla.org/pub/mobile
to be mounted from the filer side instead of the client side.

Creating/splitting off for size managability
/mnt/netapp/stage/archive.mozilla.org/pub/b2g
/mnt/netapp/stage/archive.mozilla.org/pub/thunderbird
/mnt/netapp/stage/archive.mozilla.org/pub/xulrunner

Mounting the stage volume underneath the ftp_stage volume, on the filer side.

Again, this is pretty much an administrative move (all data will end up at the same places as it started), but it should then remove all client-side locking that would prevent us from doing filer maintenance.

* Date/time: next TCW, probably 2 hours needed just to be safe.
* Affects product delivery's main mounts
* impact centers within a TCW.
* Notif is part of the TCW
* gcox on point
Flags: cab-review?
Approved by the CAB on Feb 12 to be carried out during the TCW on Feb 22nd.
Blocks: 971818
Flags: cab-review? → cab-review+
Assignee: server-ops → gcox
Blocks: 963768
The simplest way to avoid writes during this period would be to disable the inbound services on upload[12] & upload-cron. I.e. all changes from the build network are made via these 3 hosts.

Ideally, we can find a switch to throw to make these hosts inaccessible from the build network during the critical window. 

Jake: is it possible to take these hosts offline during the critical window? Or down their network connections?

If we can't stop access to those hosts, we'll have to kill all running jobs -- almost every build & test step produces output stored on ftp. That will be time consuming and disruptive (lengthening TCW).
Flags: needinfo?(nmaul)
My vote is remount read-only on upload1/upload2/upload-cron, final rsync, swap to junction mount. Jobs will fail if they happen to finish in the few mins of the change, but it's a tree closure and they can be rerun. Minor fallout for RelEng uploading logs, easy to fix.
One possible issue I see is that files could "pile up" in /tmp/tmpXXX directories since the move commands (to the final destination) will fail and be retried.

Hal suggests we do an 'rm -rf' of those files.
Probably we'll be OK because there's more than 50G free for /tmp and the outage should be fairly short for each partition. Will keep an eye on it during the TCW though.
:nthomas's solution seems reasonable to me. (commenting just to +1 and remove the NEEDINFO)
Flags: needinfo?(nmaul)
Greg, could you record the work you're planning to do in breaking up firefox/ here ?
The vol known as 'stage', /mnt/netapp/stage/archive.mozilla.org/pub/firefox, has 6 subdirs:
bundles  candidates  nightly  releases  tinderbox-builds  try-builds

* candidates is already a junction mount, no change there.
* nightly, tinderbox-builds, try-builds will be broken to new junction mounts.
* releases and bundles will sync off to a new volume (internally called archivemo_firefox), where they will be at the root of the volume instead of in a qtree.
* the junction mounts beneath stage will be rehomed to archivemo_firefox.
* archivemo_firefox will become a junction mount beneath the main ftp_stage volume, which will itself be going through a de-qtree'ing process.

This came about because of my own oversight: I forgot that you can't junctionmount a qtree'd point in a subvolume.  If we didn't proceed on this we wouldn't achieve the goal of untangling this mount, the old stage volume would've stayed in place.  The breakup of stage/subjunctioning volumes are just bonus wins.
qtrees eliminated, all mounts folded up into junction mounts, and now just* 'ftp_stage' is mounted by the clients.  More shuffling will need to take place in a future window, but this streamlines the process.


* There's also pvtbuilds and ffxbld, but they aren't nested in the stage mount so they don't count.
Blocks: 974220
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Also: checkins 82952, 82953, 82954, 82956 were the checkins to puppet, removing old mount references.
See Also: → 858609
Product: mozilla.org → Infrastructure & Operations
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.