Closed
Bug 425291
Opened 16 years ago
Closed 16 years ago
Verify VMs have "independent" switch correctly unchecked
Categories
(Release Engineering :: General, defect, P1)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Unassigned)
Details
Attachments
(1 file)
1.22 KB,
text/plain
|
Details |
If "independent, non-persistent" switch is set in VM, powering down the VM will wipe out the contents of the disk completely. This was incorrectly set in the refimage, but appears to be correctly manually overridden in at least a few VMs. We discovered this yesterday when we powered down a VM so IT could copy it. The copy was identical to the original; both were missing the contents of the disk, and Mikeal had to manually reinstall everything. :-( We need to review all of our VMs to confirm this "independent" switch is unchecked. For any that are running with this incorrect switch, we cannot power down the VM without doing hot disk backup first. This bug is to track reviewing (and if needed fixing) all of our VMs so we can restart safely. (ps: the win2k3sp2-vc8tools-ref-vm ref image is now confirmed to be correct.)
Reporter | ||
Updated•16 years ago
|
Priority: -- → P1
Comment 1•16 years ago
|
||
Here's a list of all the VM disk images currently mis-configured. Some fancy shell foo breaks that down to seven boxes (five if you ignore the two ref images): [root@bm-vmware11 volumes]# grep -l independent-nonpersistent netapp*/*/*vmx | sort -u | awk -F/ '{print $2}' bm-wiin2k3-mobile01 sm-linux-mobile-tbox01 CentOS-5.0-ref-tools-vm CentOS-5.0-ref-tools-vm-scrubbed qm-centos5-moz2-01 qm-win2k3-moz2-01 qm-win2k3-stage-pgo01
Comment 2•16 years ago
|
||
Sorry, call it four - qm-win2k3-stage-pgo01 isn't real.
Comment 3•16 years ago
|
||
When you say "wipe out the contents of the disk completely" do you mean the disk was actually blanked, i.e. no data or all, or reverted to the pre-install state? I believe non-persistent was considered a "feature" when we set these up originally. We did it with the understanding that we're *not* meant to be changing the OS or tools on these VMs. We should be revving the ref platform if we need changes. Builds and other persistent data were always meant to be stored on a second, separate drive marked as persistent. At least that's the way preed drew it up at the outset. Do we not do this any more? Did we ever/am I misremembering? I know this bit me personally when I had to update the passwords recently. I had to take any non-persistent disks offline and switch them to persistent before changing any passwords and then switch them back again.
Comment 4•16 years ago
|
||
Both disks in the ref image are non- persistent. I can see the reasoning for the OS drive but this totally prevents hot clones which you guys have asked for on several occasions.
Comment 5•16 years ago
|
||
(In reply to comment #3) > I believe non-persistent was considered a "feature" when we set these up > originally. We did it with the understanding that we're *not* meant to be > changing the OS or tools on these VMs. We should be revving the ref platform if > we need changes. Builds and other persistent data were always meant to be > stored on a second, separate drive marked as persistent. At least that's the > way preed drew it up at the outset. I always assumed the ref images themselves were set to non-persistent to prevent accidental changes to them. In our cloning guide it says to change hard drives to "persistent" mode after cloning (https://intranet.mozilla.org/Build:Farm:CloningRefPlatforms#Cloning_a_Win32_VM). Paul, can you enlighten us?
Reporter | ||
Comment 6•16 years ago
|
||
mrz: did I remember correctly from irc that to change the setting on these VMs, we have to power down the running VM... (which would be bad!). Therefore, to avoid lossage, you thought you'd have to add a persistent disk to the VM, copy off the bits from the non-persistent disk, and *then* powerdown to flip switch, and then restore? Is that a fair summary or did I miss something?
Comment 7•16 years ago
|
||
(In reply to comment #5) > I always assumed the ref images themselves were set to non-persistent to > prevent accidental changes to them. In our cloning guide it says to change hard > drives to "persistent" mode after cloning > (https://intranet.mozilla.org/Build:Farm:CloningRefPlatforms#Cloning_a_Win32_VM). > > Paul, can you enlighten us? Well, I can't see that document anymore, but Coop pretty much has it right: I distinctly remember making the decision to put the ref images in non-persistent mode as a way to assert to myself and others that no changes could be made to a reference image without blatantly intending to do so. This was not a quote-"feature"-unquote in the "haha, it's actually a retarded bug" sense, it was a very deliberate design decision to protect the reference image from randomly picking up inadvertent changes over time, which had happened in quite painful ways in the past with the physical machines. It's a way to be able to assert this design element in a way one can prove. I'm pretty sure there's a step in the doc that says this switch must be flipped when deploying VMs, but I don't know how you are currently deploying VMs, so I don't know if that doc is even being followed anymore. It looks like all the VMs that have this setting were deployed after I left. I apologize that there's this "gotcha," but I don't now how else I would've made that step clearer, assuming it is in the document to begin with. It might be worth revisiting some of that documentation and reading through it in detail, in the case that it's not being used anymore, or it was forgotten upon my egress. I am, of course, happy to answer questions and such, but that's a bit of a chicken and egg problem, since it's sometimes difficult to know what to ask until it's bitten you. ;-) The old VMware documents that I wrote up (and there are like 4 or 5; check the "old documents" section of the Intranet Build page) may be of some use here. I don't know what mrz is referring to when he says "hot clones," but VMware may have some tools these days for doing these kind of deployments (where you create "templates," they're called in Workstation, and then the act of cloning them sets the disks up correctly). I'm thinking of Lab Manager, but looking at www.vmware.com right now, they appear to have a whole series of "IT Service Delivery" products that may assist.
Comment 8•16 years ago
|
||
> > I'm pretty sure there's a step in the doc that says this switch must be flipped It does but it's also a document I had no knowledge of until two days ago. I wouldn't really fault you on that since you did document it. > I don't know what mrz is referring to when he says "hot clones," but VMware may I'm using vReplicator from Vizioncore. It works by grabbing a snapshot and using that to clone and restore to another VM (it's primary purpose is to DR running VMs). However, since it requires snapshots to do its work, having independent mode on kills hot clones. Guess I don't care what the ref image looks like - production images should have independent checked off if you guys ever want hot clones.
Comment 9•16 years ago
|
||
(In reply to comment #8) > > > > I'm pretty sure there's a step in the doc that says this switch must be flipped > > It does but it's also a document I had no knowledge of until two days ago. I > wouldn't really fault you on that since you did document it. D'oh. Yah, sorry the handoff when I left on that wasn't... better. > I'm using vReplicator from Vizioncore. It works by grabbing a snapshot and > using that to clone and restore to another VM (it's primary purpose is to DR > running VMs). > > However, since it requires snapshots to do its work, having independent mode on > kills hot clones. > > Guess I don't care what the ref image looks like - production images should > have independent checked off if you guys ever want hot clones. Ooh, interesting. I think I just realized something... I believe I set independent mode because I assumed that we'd never do snapshots in build VMs. I also seem to remember reading a VMware technote that said that independent mode being off was slower (but this may have only been on the hosted products, not on VMFS-hosted VMs). I think there's some confusion here, though; I'm looking at the options pane for a creating a Workstation 5.x VM (which is roughly the same as ESX 3.x), and independent mode is a different setting from persistent/non-persistent mode (although one is required to set the other), so I think we're talking about different things. Coop's symptoms, and the question bhearsum asked are about non-/persistent mode, i.e. what happens when the VM gets rebooted (do the changes get dropped or committed immediately). Independent mode is about whether or not you're allowed to take snapshots. So, it's possible to enable independent mode, so vReplicator would work, but still observe the non-persistent behavior for the ref images, which is what I'd do if I were managing a virtualized build farm with reference images I wanted to be able to assert were clean. But since joduinn is saying such settings are "incorrect," maybe he's got a strategy for being able to assert configurations in VMs that better than what I had. :-)
Comment 10•16 years ago
|
||
Persistent only becomes an available option if Independent is checked on.
Comment 11•16 years ago
|
||
the two machines qm-win2k3-moz2-01 and qm-centos5-moz2-01 are safe to be rebooted with the settings fixed. I haven't made any changes to them yet and won't lose any data with a reboot.
Comment 12•16 years ago
|
||
Those two are done.
Reporter | ||
Comment 13•16 years ago
|
||
(In reply to comment #0) > (ps: the win2k3sp2-vc8tools-ref-vm ref image is now confirmed to be correct.) I've just unchecked "Independent" on: CentOS-5.0-ref-tools-vm CentOS-5.0-ref-tools-vm-scrubbed ...so now future linux & win2k3 VMs will be automatically safe to reboot.
Reporter | ||
Updated•16 years ago
|
Component: Release Engineering: Talos → Release Engineering
Reporter | ||
Comment 14•16 years ago
|
||
All thats left to fix are: bm-wiin2k3-mobile01 sm-linux-mobile-tbox01 Who is best contact person for these? Best case, we have to reboot them. Worst case, mrz will have to do the add-new-disk-and-backup dance and *then* we can reboot them.
Reporter | ||
Updated•16 years ago
|
Assignee: nobody → joduinn
Comment 15•16 years ago
|
||
(In reply to comment #14) > All thats left to fix are: > > bm-wiin2k3-mobile01 > sm-linux-mobile-tbox01 > > Who is best contact person for these? dougt
Comment 16•16 years ago
|
||
they can be rebooted whenever.
Reporter | ||
Comment 17•16 years ago
|
||
Doug: good to know. Before we can reboot, is there anything installed on these since they were imaged? Rebooting will cause any "independent" disk to be wiped clean, so if there are files there, they will be lost. (See details in comment#0.)
Comment 18•16 years ago
|
||
no, you can clobber what is ever on the boxes.
Comment 19•16 years ago
|
||
sm-linux-mobile-tbox01 was rebooted. John asked about QA VMs on QA ESX servers - none of them have this issue.
Reporter | ||
Comment 20•16 years ago
|
||
I fixed and rebooted bm-wiin2k3-mobile01. Thats the last one, so now closing. (Thanks for the confirm on QA VMs, mrz!)
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 21•16 years ago
|
||
we missed one. qm-moz2-unittest01 is set to non-persistent.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 22•16 years ago
|
||
ps, we suspect it's been down since thursday, May 8, 2008.
Reporter | ||
Comment 23•16 years ago
|
||
As far as we can tell, qm-moz2-unittest01 was on bm-vmware08, and just migrated to bm-vmware07 on Thurs night.
Assignee: joduinn → nobody
Status: REOPENED → NEW
Reporter | ||
Comment 24•16 years ago
|
||
mrz: We are not aware of changing qm-moz2-unittest01 VM. Has it always been this way, and we somehow missed it in the last go-around? Did the fancy shell foo in comment#2 cover all ESX hosts in Build and QA network? Can you re-run it, and confirm status across all 14 ESX hosts and 103 VMs?
Comment 25•16 years ago
|
||
I suspect this one wasn't on the build esx hosts but on qm-vmware02 or qm-vmware01 and moved. If that's the case, it was left out of my original list.
Comment 26•16 years ago
|
||
[root@bm-vmware01 volumes]# grep -l independent-nonpersistent netapp*/*/*vmx | sort -u | awk -F/ '{print $2}' qm-moz2-unittest01
Reporter | ||
Comment 27•16 years ago
|
||
From discussion with mrz on irc, the above script magic checks all 85 VMs on all 12 Build ESX hosts. I've just manually looked at each of 18 VMs on both QA ESX hosts and found the following: 1) CentOS-5.0-ref-tools-vm had "independent" checkbox set. I've unchecked the checkbox. 2) qm-jstest01 (on drive#1) has "independent" checkbox set. This needs to be fixed.
Comment 28•16 years ago
|
||
More accurately that shell command only grep'd through the vmx files on the netapp iscsi datastores from the build ESX perspective.
Reporter | ||
Comment 29•16 years ago
|
||
From quick chat with mrz, this means all 12 Build ESX hosts in the Build "data center" were verified. Excluded from his shell-foo-magic are: - the 2 QA ESX hosts in QA "data center" as I've already manually gone through them manually in comment#27. - the dev ESX host, as we dont care about VMs on the dev ESX host.
Reporter | ||
Comment 30•16 years ago
|
||
(In reply to comment #27) > 2) qm-jstest01 (on drive#1) has "independent" checkbox set. This needs to be > fixed. bclary and I fixed this. Restarted VM with independent checkbox unchecked. Thats it for this bug, so closing as FIXED.
Status: NEW → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•