Closed Bug 700502 Opened 13 years ago Closed 13 years ago

VMware outage in sjc1

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Unassigned)

Details

This outage happened yesterday (sunday). This bug is to track the outage along with our other usual outage bugs. Leaving as "severity:normal" because the outage is now over. Leaving open until IT has finished investigation and posted RFO about what broke.


-------- Original Message --------
Subject: VMWare outage today
Date: Sun, 06 Nov 2011 16:58:10 -0600

We experienced a partial outage of the VMWare cluster in sjc1 today.

bm-vmware02 disappeared at 12:15:28, and returned to service at 12:30:52.

The ASR (automatic server reboot) functionality of the host hardware
caused the reboot after the host became unresponsive.  The VMs on this
host are not set to autostart, so they remained down until they were
brought back manually at about 2:30, after some investigations into the
cause of the failure.  The affected VMs are:

bm-vpn01
production-master02
moz2-linux-slave18
moz2-linux-slave27
moz2-linux-slave31
moz2-linux-slave37
try-linux-slave02
try-linux-slave20
try-linux-slave28

Note that, from everything I can tell, production-master02 is not in use
right now.  All hosts except that one are now back up and, to the best
of my ability to tell, functioning properly.  I filed bug 700180 to
delete production-master02.

I don't have any additional information on the cause of the failure.

Dustin
As I said in the RFO you quoted, there's no additional information.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I thought we had a description of the outage as it happened, but didnt see anything about what *caused* the outage.

Per discussion with mrz today, I have now learned that the RFO from IT explaining what caused the outage is this line in comment#0: "The ASR (automatic server reboot) functionality of the host hardware caused the reboot after the host became unresponsive." 

Given that this intermittent-misfire-of-ASR is a known issue with VMWare, happens very infrequently (approx once a year or so), and leaves nothing in the logs to help debug why it happened, IT feels this is not worthwhile investigating further.   

Adding to bug for historical tracking, and leaving bug closed.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.