Closed Bug 501251 Opened 15 years ago Closed 15 years ago

Test ESX upgrade on single host

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: phong)

References

Details

We'd like to verify that our existing VMs will work fine during the period when the ESX hosts have been updated and we have a chance to update VMWare tools. So please take one of the ESX hosts and put it in a separate cluster (so that DRS doesn't move machines around). After the ESX upgrade we can migrate moz2-win32-slave28 & 29, try-w32-slave19, moz2-linux-slave22 & 23, and try-linux-slave19 onto that host. RelEng will monitor those machines for a couple of days and if all is well then we can proceed with the other host upgrades. We could also test that the new VMWare tools don't introduce any problems.

Incidentally, perhaps we can use this as a deployment strategy to keep machines with new VMWare tools on updated ESX hosts.
I forgot to mention that we've already upgraded the IT - San Jose Production cluster.  I don't think any of the VMs on those 4 hosts are have the updated VMware tools.  They are all running fine since last week.
I just released these are all on the new Intel cluster and can't be moved to bm-vmare08 without shutting them down first.
(In reply to comment #1)
> I forgot to mention that we've already upgraded the IT - San Jose Production
> cluster.  I don't think any of the VMs on those 4 hosts are have the updated
> VMware tools.  They are all running fine since last week.

How many Centos5 VMs do you have there ? How about windows ?

(In reply to comment #2)
> I just released these are all on the new Intel cluster and can't be moved to
> bm-vmare08 without shutting them down first.

I chose them at random really. Feel free to pick any two moz2-win32-slaveN machines, one try-w32-slaveN, and the same for linux.
(In reply to comment #3)
> (In reply to comment #1)
> > I forgot to mention that we've already upgraded the IT - San Jose Production
> > cluster.  I don't think any of the VMs on those 4 hosts are have the updated
> > VMware tools.  They are all running fine since last week.
> 
> How many Centos5 VMs do you have there ? How about windows ?
1 - Win2k3 and the rest are Rhel

> 
> (In reply to comment #2)
> > I just released these are all on the new Intel cluster and can't be moved to
> > bm-vmare08 without shutting them down first.
> 
> I chose them at random really. Feel free to pick any two moz2-win32-slaveN
> machines, one try-w32-slaveN, and the same for linux.

moz2-linux-slave14/17, moz2-win32-slave20/22, and try-win32-slave09 are moved over. with just those 5 vms, it's almost at 100% of the memory.  is that good enough for now?  if not, i can move another ESX host over and and move a few more VMs.
That'll be close enough, we'll watch it to see if anything comes up.
can we call this successful and upgrade the rest of them now?
Can't see any regressions for the build slaves running with old tools on the updated host, just known orange on unit tests. Now going ahead with updating the tools and verifying that.
Tried moz2-linux-slave14 first and hit some snags. I'm putting this here as a doc for RelEng mostly.

The automatic tools upgrade completed successfully according to VI, but seemed to not restart the tools ("not running"). Starting it as root gave no errors, but since it needed a network restart too I decided to test it in the reboot case. There were problems unmounting /builds on shutdown, something was making the kernel think it was busy  (buildbot was already shut down). There was a storage problem on the console when I first came to it, so perhaps that's related.

On reboot I discover /etc/fstab has been changed. The main problem is that /builds is not present, but we also lose we lost the noatime option on / and the Puppet header. This is
  http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1006718&sliceId=1&docTypeID=DT_KB_1_1&dialogID=24732296&stateId=1%200%2024734213
summary: vmware tools upgrade has a bug that wipes out any mods you made workaround: backup /etc/fstab before upgrade and then restore
comment: OMGWTFBBQ!!11!!!

Restoring puppets backup from June 30 (/etc/fstab.old.0) would do, although it would be good to remove this section
 # Beginning of the block added by the VMware software
 .host:/ /mnt/hgfs       vmhgfs  defaults,ttl=5  0       0
 # End of the block added by the VMware software
which causes a spurious error about mounting local file systems on boot (and vmware tools also removed it). So I did that. 

moz2-linux-slave14 is back in the prod pool.
moz2-linux-slave17 done at 16:35 PDT. Steps:
1, login as root, cd /etc, cp fstab fstab.bak
2, Using VI, do automatic upgrade of vmware tools
3, back as root, cp fstab.bak fstab, edit fstab to remove the hgfs lines in comment #8
4, reboot
moz2-win32-slave20 done at 17:14 PDT
moz2-win32-slave22 done at 17:18 PDT
 try-win32-slave09 done at 17:22 PDT

Notes
1, Both moz2 windows slaves think they have DNS like win32-slave20.uib.local rather than win32-slave20.build.mozilla.org.
2, When VI says "Completed" for a win32 tools upgrade, it means tools are installed and a reboot has been initiated, rather than it's back up after the reboot
Can't see any regression from the upgraded host/upgraded tools scenario, after excluding the known causes of orange and red builds. 

bhearsum/catlee, do you have any comments before Phong goes ahead with upgrading the world ?
Blocks: 500761
Sounds OK to me to go ahead. The Linux stuff sucks, so we'll have to careful when we do the tools installs.
I am starting the upgrade now.
Can we close this bug or move it over to release for your internal tracking of the tools upgrade?
Thanks for indulging us Phong.

Updates done so far:
* moz2-linux-slave14, 17
* moz2-win32-slave03, 20, 22
* try-win32-slave09
* production-master, possibly staging-master

Linux to be done per comment #9, windows can be done automatically. Buildbot shutdown in each case.
Assignee: phong → nobody
Component: Server Operations: Tinderbox Maintenance → Release Engineering
QA Contact: mrz → release
Summary: Test ESX upgrade on single host → Upgrade VMWare tools on RelEng VMs
Actually, the deps on this mean we should make a new bug.
Assignee: nobody → phong
Status: NEW → RESOLVED
Closed: 15 years ago
Component: Release Engineering → Server Operations: Tinderbox Maintenance
QA Contact: release → mrz
Resolution: --- → FIXED
Summary: Upgrade VMWare tools on RelEng VMs → Test ESX upgrade on single host
Bug 503392 for the tools updates on the slaves.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.