Closed Bug 524047 Opened 15 years ago Closed 15 years ago

Tracking bug for planned power outage

Categories

(Release Engineering :: General, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: mozilla)

Details

On 14th November, there were be no power at 650 Castro building for most of the day, for maintenance work. This bug is to track:
- what systems to take down/bring back up
- when to send announcements to warn developers (if it takes down the mobile and geriatric machines, should we close the tree?)
Clarification - building is turning down all 120V power.  All the systems in the server room will remain up (excluding the networking hardware - if that's a requirement we can rig something up).
(In reply to comment #1)
> Clarification - building is turning down all 120V power.  All the systems in
> the server room will remain up (excluding the networking hardware - if that's a
> requirement we can rig something up).

So, this means that all of our VMs and minis will stay up but we won't be able to access them?
More or less, yes.
So does this mean even the networking equipment that is in the server room is going to be shut down?  If that is the case, they I think we should shut down the Geriatric master and slaves.  I will try to look in their bioses for a turn on after power failure option.
John: feel free to re-delegate to whoever draws the short straw from the MV crew.
Assignee: nobody → joduinn
Priority: -- → P3
(In reply to comment #4)
> So does this mean even the networking equipment that is in the server room is
> going to be shut down?  If that is the case, they I think we should shut down
> the Geriatric master and slaves.  
Per mrz, yes, so yes, lets power them all down.


> I will try to look in their bioses for a turn
> on after power failure option.
ok.
What needs to be powered off for the weekend:

- mobile master VM and all n810s devices 
- geriatric master VM and all old slaves
- talos-staging-master02 and connected 11 minis
- extra minis running in QA lab.

Anything else?
dholbert:

mrz / myself will update this bug on Saturday, after all power work is completed, all building systems are back online and we think its safe to power back up the n810s. 

Again, thanks for your help; if there's any questions/problems, just call my cellphone.
Sounds good -- I'll watch this bug on Saturday.
Will the ESX servers stay on? Otherwise, how do we access them via VI?
N/m, saw comment 3.
The plan right now is:

1) Fri, 5pm?6pm? we'll power down all n810s, geriatric machines, minis here in
650Castro. This needs physical touch, so we're doing this Friday end of
business. From discussions earlier this week, we will only be closing the FF3.6 / mozilla-1.9.2 tree. All other trees will remain open, despite lack of mobile coverage.

2) Sat 9am-4pm: electrical work at 650 Castro.

3) IT to notify us when network, wifi, etc are back online and its safe
to start repowering up devices.

4) joduinn to power up mobile-master remotely (if needed - it may stay online throughout, and auto-reconnect). dholbert to wait for ok in this bug
before powering up n810s physically. 

5) Once n810s are online, and running some green builds, we'll reopen the tree. Exact time for this depends on (2), (3), (4) and also how long it takes to cycle some green builds. If I had to guess, I'd say sometime between 5-8pm PST, but thats a SWAG.

Note: We're leaving the minis, and geriatric machines, at 650Castro offline
until Monday morning. There's enough machines in colo to handle the
reduced load over the weekend, so this is ok.

Let me know if I missed anything. Obviously, feel free to call my cellphone, the number is in phonebook.
* production-mobile-master is the VM =)  mobile-master is the 1.83ghz mini that can be used for talos.  I might unplug that one tomorrow (Friday).
Any reason why this is mozilla-confidential? I'd like to be able to link to it when I tell the community that the tree is going down.
Couldn't get anyone in #build to tell me why this was closed, so opening it up. Please, uh, don't put your phone numbers in here ;)
Group: mozilla-corporation-confidential
Assignee: joduinn → aki
On Friday afternoon, we discovered this power outage had been rescheduled to Sat
21st. Handing over to Aki, as I'll be away that weekend.
Adding Raymond to the bug since he will be handling the shutdown of the QA lab on Friday.
I called Cheryl and verified that the power outage is on for this Saturday.
(In reply to comment #12) (before the outage was postponed)
> 4) joduinn to power up mobile-master remotely (if needed - it may stay online
> throughout, and auto-reconnect). dholbert to wait for ok in this bug
> before powering up n810s physically. 

For the record, I'm out of town this coming weekend, so I won't be available to power up the n810s. (I told John about this when we first learned about the postponement, a week ago -- I assume he or someone else is looking into other solutions for getting these devices back up when the power comes on.)
Ah. joduinn is in Ireland, and my parents are coming up on Saturday.

I can try leaving the n810s on battery and remotely bring them back up Saturday afternoon/early evening; if that doesn't work I can drive into the office later in the evening and power 'em up.
I'm only going to be shutting down the Machines that belong to QA at the QA LAB on Friday. Who from IT is going to handle the Talos machines in the lab?
I can make sure they're off by EOD Friday, and bring them up on Monday.
mrz and joduinn are both out on friday and saturday, so i'll need to call cheryl again to update her contacts.
I'm still getting emails thought and IT's involved in this process.
I am going to get a car for today.  I can turn the devices back on.  Can someone give me a call on my cell (in phonebook) when the power is back on?
jhford is sitting in MV waiting on word; does anyone know anything?
I'm going to start calling people in 20.
I haven't heard any updates but as I had mentioned, as soon as things are okay we'll update.

Quite frankly I don't believe we lost power in the fashion building management said.  The network gear in suite 380 is still up:

mv-core1 uptime is 24 weeks, 2 days, 59 minutes
mv-core2 uptime is 24 weeks, 1 day, 3 hours, 59 minutes

The switch in suite 280 came up 5 hours ago:
mvsx2-01 uptime is 5 hours, 57 minutes

Derek had planned on being onsite around 4pm at the end of their work window.
There are also a number of devices already connected to wifi.
I've gotten green cycles on the n810s.  Opening the tree.
mrz, aki and myself have not heard "all clear" from Cheryl. I've called her direct number just now, but it went to answering service operator who had no info about power outage work. Left msg, but not sure when I'll get callback.

At this time (6:40pm Sat evening), I'm going to guess that electrical work is completed for the day and its safe to leave machines back up. If there are any issues, please comment here. 

The minis and most of the n810s are back up and online; a few n810s are still off, but there's certainly plenty online for the weekend load.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.