Closed Bug 603452 Opened 9 years ago Closed 9 years ago

Santa Clara datacenter offline

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: dmoore)

Details

...losing all RelEng systems there and closing the tree.
Assignee: server-ops → dmoore
Diagnosing remotely, at the moment, but our VPN servers lost power just before 13:11 and recovered at 13:26. No other infrastructure appears to be significantly impacted.

We will be investigating more on site and with the assistance of the facility.
Looks like they're down again, according to nagios?
summary from irc and in-person meetings:

*) faulty PDU in rack#5 caused power problems in internap today ~13:15; blew out one of our 2 power circuits in internap. Internap crew unplugged that one rack, and were able to return power to other circuits. 

*) This outage lost us power on VPN machine and some test slaves in internap, but not the RelEng build masters. As VPN was lost, all machines in internap appeared offline to nagios. 

*) to get to a known state, we stopped all masters and minis, then started masters, then started slaves which reconnected to the masters. 

*) tree remained closed until we got the last of the minis online a few minutes ago.

*) currently all ix build machines in rack#5 are off, and will remain off until replacement PDU arrives (or electrical problems with rack#5 are debugged). Also, aki will file a standard reboot bug for 5 minis that did not reconnect as expected. Apart from these, all other RelEng systems are back online as normal and tree now reopened.
Severity: blocker → normal
back online
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.