This outage happened yesterday (sunday). This bug is to track the outage along with our other usual outage bugs. Leaving as "severity:normal" because the outage is now over. Leaving open until IT has finished investigation and posted RFO about what broke. -------- Original Message -------- Subject: VMWare outage today Date: Sun, 06 Nov 2011 16:58:10 -0600 We experienced a partial outage of the VMWare cluster in sjc1 today. bm-vmware02 disappeared at 12:15:28, and returned to service at 12:30:52. The ASR (automatic server reboot) functionality of the host hardware caused the reboot after the host became unresponsive. The VMs on this host are not set to autostart, so they remained down until they were brought back manually at about 2:30, after some investigations into the cause of the failure. The affected VMs are: bm-vpn01 production-master02 moz2-linux-slave18 moz2-linux-slave27 moz2-linux-slave31 moz2-linux-slave37 try-linux-slave02 try-linux-slave20 try-linux-slave28 Note that, from everything I can tell, production-master02 is not in use right now. All hosts except that one are now back up and, to the best of my ability to tell, functioning properly. I filed bug 700180 to delete production-master02. I don't have any additional information on the cause of the failure. Dustin
As I said in the RFO you quoted, there's no additional information.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
I thought we had a description of the outage as it happened, but didnt see anything about what *caused* the outage. Per discussion with mrz today, I have now learned that the RFO from IT explaining what caused the outage is this line in comment#0: "The ASR (automatic server reboot) functionality of the host hardware caused the reboot after the host became unresponsive." Given that this intermittent-misfire-of-ASR is a known issue with VMWare, happens very infrequently (approx once a year or so), and leaves nothing in the logs to help debug why it happened, IT feels this is not worthwhile investigating further. Adding to bug for historical tracking, and leaving bug closed.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.