Closed Bug 448908 Opened 16 years ago Closed 16 years ago

bm-vmware05 crashed, taking down a bunch of VMs

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: mrz)

Details

I tried to ssh or vnc to it but no luck

Any suggestion?
There's a lot of this on the console:
  sd 0:0:0:0 timit out command, waited 360s
  end_request: I/O error, dev sda, sector 20291311
for assorted sectors. And some 
  Read-error on swap-device (0:0:0:20291319)
but not always paired with the first set. It's not using very much CPU, RAM, or doing a bunch of disk or network access, according to the Performance info in the VI Client.

Can't get a prompt on the console, so trying rebooting.
Assignee: nobody → nthomas
The gentle method (using "Reboot Guest VM") got no response. The more determined method (using "Reset") is hung at 95% completion, so need help from Server Ops. There isn't a good place to reclone this machine from so lets try to recover it first.
Assignee: nthomas → server-ops
Severity: normal → major
Component: Release Engineering → Server Operations
QA Contact: release → mrz
FWIW, looks like it failed sometime after 09:10 PDT on Wed 30 Jul.
management process on bm-vmware05 died, trying to figure out how to restart.
Everything is pointing to an iSCSI issue - can't tell which iSCSI LUN was at fault.

While debugging with VMware on the phone, the box crashed taking down the following VMs:

fx-linux-1.9-slave1
moz2-win32-slave06
prometheus-vm
tb-linux-tbox
try-master

Punting over to RE to bring things back up.  Still working with VMware on root cause.
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Talked to mrz:

1) Filed bug#449059 to track reviving the downed VMs.
2) Pushing this bug back to IT to track fixing the root cause of kernel panic on bm-vmware05.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Summary: fx-linux-1.9-slave1 is unreachable → bm-vmware05 crashed, taking down a bunch of VMs
Assignee: server-ops → mrz
vmware blades HP's management tools.  HP doesn't think so but does recommend upgrading from 8.0 to 8.1.  

I doubt it's related to that since it only crashed after the vmware tech was trying to fix some other SAN issues but I'll upgrade.
Status: NEW → ASSIGNED
updated.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.