Closed Bug 726384 Opened 12 years ago Closed 12 years ago

bm-vmware01.build.sjc1 Down

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rbryce, Assigned: rbryce)

Details

Server started to alert ilo communication errors. I restarted the iLO card to hopefully regain access to Blade 7 the blade appears failed
The blade seemed to crash as soon I as logged in after the iLO reset.
That host was running the following VMs:

dev-stage1
moz2-linux-slave19
moz2-linux-slave27
moz2-linux-slave34
moz2-linux-slave37
moz2-linux-slave39
moz2-linux64-template
preproduction-master
preproduction-stage
try-linux-slave02
try-linux-slave03
try-linux-slave22

Of those, we can survive until working hours without the slaves.  That leaves

preproduction-master
preproduction-stage
dev-stage1

all of which are a part of the releng dev (aka preprod aka staging) environment, and not mission-critical, but they should be up sooner rather than later as releng can't stage changes without those components.  If possible, we should re-start the latter three VMs on another ESX host.  I don't know how to do that in vSphere, but I think it's possible..
preproduction-master
preproduction-stage

Have been moved to other vm hosts.  Dev-stage1 is still in the process of copying files should finish up in an hour or so.
Group: infra
The VMs listed above are running again.  Bm-vmware01 suddenly came back hours later.  I dont trust the hardware and am reluctant to bring the host back into the esx cluster until a hardware diagnostic can be run.
Of the VMs in comment #2, only preproduction-master and preproduction-stage are responding to ping. Could we have dev-stage1 back asap, and the others at your convenience ?
Severity: normal → critical
Severity: critical → major
Assignee: server-ops → rbryce
:nthomas I forgot to change the MAC address in dhcp when I migrated the VM.  Server should be up now.
Thanks for fixing up dev-stage1. What's the plan for investigating the host problem and bringing up the remaining VMs ?
I have tried to find with no success any problems with this server.
rbeyce: can we bring it back up along with the linux builders vms?
bm-vmware01 is back online.  I started the slaves below. I still have no clue what caused bm-vmware01 to act so badly for 12 hours and then suddenly resurrect. For now I think preproduction-master, preproduction-stage,and dev-stage1 should stay on other esx hosts.

--booted vms --
moz2-linux-slave19
moz2-linux-slave27
moz2-linux-slave34
moz2-linux-slave37
moz2-linux-slave39
moz2-linux64-template
try-linux-slave02
try-linux-slave03
try-linux-slave22
--
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.