Closed Bug 726384 Opened 13 years ago Closed 13 years ago

bm-vmware01.build.sjc1 Down

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rbryce, Assigned: rbryce)

Details

Server started to alert ilo communication errors. I restarted the iLO card to hopefully regain access to Blade 7 the blade appears failed
The blade seemed to crash as soon I as logged in after the iLO reset.
That host was running the following VMs: dev-stage1 moz2-linux-slave19 moz2-linux-slave27 moz2-linux-slave34 moz2-linux-slave37 moz2-linux-slave39 moz2-linux64-template preproduction-master preproduction-stage try-linux-slave02 try-linux-slave03 try-linux-slave22 Of those, we can survive until working hours without the slaves. That leaves preproduction-master preproduction-stage dev-stage1 all of which are a part of the releng dev (aka preprod aka staging) environment, and not mission-critical, but they should be up sooner rather than later as releng can't stage changes without those components. If possible, we should re-start the latter three VMs on another ESX host. I don't know how to do that in vSphere, but I think it's possible..
preproduction-master preproduction-stage Have been moved to other vm hosts. Dev-stage1 is still in the process of copying files should finish up in an hour or so.
Group: infra
The VMs listed above are running again. Bm-vmware01 suddenly came back hours later. I dont trust the hardware and am reluctant to bring the host back into the esx cluster until a hardware diagnostic can be run.
Of the VMs in comment #2, only preproduction-master and preproduction-stage are responding to ping. Could we have dev-stage1 back asap, and the others at your convenience ?
Severity: normal → critical
Severity: critical → major
Assignee: server-ops → rbryce
:nthomas I forgot to change the MAC address in dhcp when I migrated the VM. Server should be up now.
Thanks for fixing up dev-stage1. What's the plan for investigating the host problem and bringing up the remaining VMs ?
I have tried to find with no success any problems with this server.
rbeyce: can we bring it back up along with the linux builders vms?
bm-vmware01 is back online. I started the slaves below. I still have no clue what caused bm-vmware01 to act so badly for 12 hours and then suddenly resurrect. For now I think preproduction-master, preproduction-stage,and dev-stage1 should stay on other esx hosts. --booted vms -- moz2-linux-slave19 moz2-linux-slave27 moz2-linux-slave34 moz2-linux-slave37 moz2-linux-slave39 moz2-linux64-template try-linux-slave02 try-linux-slave03 try-linux-slave22 --
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.