Closed Bug 951322 Opened 11 years ago Closed 10 years ago

Volume moves for proddist

Categories

(Infrastructure & Operations :: Change Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gcox, Assigned: gcox)

References

Details

When: Treeclosing window of 11 Jan 2014
Duration: ~15 minutes
Plan: migrate product delivery volumes* from old filer to new filer in scl3.  Rollback is to revert back to the old volume.
Affects: ftp cluster, rsync.
Notif: part of the treeclosure
Who: gcox+relops

Volumes:
/vol/archivemo_mobile
/vol/archivemo_thunderbird
/vol/archivemo_seamonkey
/vol/archivemo_xulrunner
/vol/archivemo_b2g
/vol/stage
/vol/ftp_stage
/vol/tinderbox_builds
/vol/ffxbld
/vol/pvtbuilds
Flags: cab-review?
Depends on: 951731
Approved by the CAB on 18th Dec to be done during the TCW on 11th Jan.
Assignee: server-ops → gcox
Flags: cab-review? → cab-review+
As well as the upload of new bits, we have some cron jobs that move data around on this volumes. Will we be setting the old filer r/o or should we plan to disable the crons ?
The basic plan for volume moves goes something like this (glossing over some administrative puppet changes and prep work):
* "service shutdown" (whatever that means to the people who use the servers that mount the volumes)
* I flip the volume to read-only and unmount it.
* I do a final sync from the now-r/o volume to the new volume
* I break the mirroring from old to new, thus turning the new volume r/w.
* I mount the r/w volume
* I hand it off to let the services start back up.

So, the crons should be disabled for the window.  The names of mountpoints as seen from the OS will be unchanged, but IPs and exports will be different.
OK. From the RelEng side the following should be enough:
* disable crond on upload-cron.private.scl3  (to prevent bits being moved around). Just non-root is fine if you need root for puppet etc.
* disable non-root ssh connections for upload1.dmz.scl3, upload2.dmz.scl3  (to prevent new uploads)
(In reply to Nick Thomas [:nthomas] from comment #4)
> OK. From the RelEng side the following should be enough:
> * disable crond on upload-cron.private.scl3  (to prevent bits being moved
> around). Just non-root is fine if you need root for puppet etc.
> * disable non-root ssh connections for upload1.dmz.scl3, upload2.dmz.scl3 
> (to prevent new uploads)

How many different accounts do we have on the servers? i.e. disabling on logins is easiest by tweaking the authorized key file, afaik.

Same for crontabs if only a few accounts (or are they in /etc/cron* files/dirs)

added jhopkins who will be doing the releng work.
Moved during TCW.  We probably could've handled this one better, since it didn't hardhat.
Ran it in 4 updates.

Volume ffxbld:
modules/productdelivery/manifests/upload.pp modules/productdelivery/manifests/upload_cron.pp, changes 80574 and 80588, since I had an error in fixing upload-cron.

Volume pvtbuilds:
modules/fuzzing/manifests/mount.pp modules/productdelivery/manifests/mounts/pvtbuilds_ro.pp modules/productdelivery/manifests/mounts/pvtbuilds_rw.pp, change 80576.

Volume tinderbox_builds:
modules/productdelivery/manifests/ftp.pp modules/productdelivery/manifests/upload.pp modules/productdelivery/manifests/upload_cron.pp modules/productdelivery/manifests/rsync.pp, change 80580.

The stage/ftp_stage/archivemo subvolumes:
same files as tinderbox_builds, change 80584.

Because of the not-hardhat, there was a load spike/hung mounts/zombiefied apache condition on much of the ftp (and to a lesser extend on the rsync/uploads).  We did some apache restarts and ended up rolling-reboots into the cluster to make sure mounts cleared out.  Affected time ~1100-~1130 PST; nagios was clear at the end.

Updated the product delivery mana page with the new mounts.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.