Closed Bug 383742 Opened 17 years ago Closed 17 years ago

Netapp mount netapp-c-01 is having indigestion

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: preed, Assigned: preed)

Details

The following VMs went missing early this morning:

-- fx-win32-tbox
-- argo-vm
-- balsa18-branch
-- tb-win32-tbox

These are all VMs on bm-vmware09, using the netapp-c-01. Other VMs on -09 using the netapp-d-02 mount seem to unaffected (although, it doesn't look like they were booted in the first place).

Some other debugging info:

From the bm-vmware09 console:

[root@bm-vmware09 netapp-d-02]# dir /vmfs/volumes/netapp-d-02/
fx-linux-tbox  tb-linux-tbox
[root@bm-vmware09 netapp-d-02]# dir /vmfs/volumes/netapp-c-01
dir: /vmfs/volumes/netapp-c-01: No such file or directory
[root@bm-vmware09 netapp-d-02]#

When trying to view the console of the VMs in question, Virtual Infrastructure manager claims "Virtual Machine config does not exist."

The log window on some of the VMs says:

at 5:22:54 am, "No bootable CD, floppy, or hard disk was detected."

oremj jumped on this, but I've closed the tree due to lacking coverage (and missing nightlies, too) so this is a blocker.
Assignee: server-ops → justin
Netapp paniced - working with netapp now.
This has been attributed to a firmware bug - we need to do a file system check which may take a hour or two, but new disks and firmware will be installed.  I'll comment once it's all back up.  ETA is 2pm (hopefully sooner).
Status: NEW → ASSIGNED
(In reply to comment #2)
> ETA is 2pm (hopefully sooner).

I believe these killed nightly builds, so we'll need to get those out before reopening the tree.

Expect the tree to reopen about 2.5-3 hours after the tinderboxen come back.  

Assignee: justin → server-ops
Status: ASSIGNED → NEW
sorry, misclicked
Assignee: server-ops → justin
Netapps are both ready to go with new shelf firmware.  Please bring up the VMs when you are ready (preed asked me to wait for him)
Severity: blocker → major
Booting VMs now.
Alright, tinderbox running; will reopen tree after the nightlies come out.
Assignee: justin → preed
Netapp crashed again. Back to Justin. He's on the phone with them now.
Assignee: preed → justin
Update:

Today, we had a problem with the NetApp-backed VMware machines; four VMs were affected by this outage:

-- fx-win32-tbox (Firefox trunk win32)
-- tb-win32-tbox (Thunderbird trunk win32)
-- balsa-18branch (1.8-branch memory leak tester)
-- argo-vm (Firefox trunk linux)

Justin spent most of the afternoon on the phone with NetApp, and it seems like we hit a pretty bad firmware bug where we could try to work the problem more, but it's not looking good.

As such, we're planning to reformat and rebuild these VMs.

For fx-win32-tbox and tb-win32-tbox, these are ref images, so we can pull the image from bm-vmware01, convert them to ESX 3, and re-deploy the tinderboxen.

We're going to pull the backup of the balsa image and redeploy it.

We have no copy of argo-vm, however physical argo has not been reimaged/redeployed, so the current plan is to move the build trunk linux build back to argo for now.

The priority ordering for getting VMs back up is:

1. fx-win32-tbox
2. argo
3. balsa-18branch
4. tb-win32-tbox

We'll be working on this tonight, and will post another update around 1:00 am PDT.
Update:

After some heroics (that took all of his Friday night), Justin worked with the NetApp guys to get the array in hopefully stable state. The array is currently rebuilding itself, so operations on it are very slow.

Based upon the current rebuild rate, it should complete in ~6 hours.

While the array is rebuilding, we are backing up the actual instances of each VM affected to another location to minimize dataloss.

After we get the first three VMs on the list restored and in working order, we'll need to:

1. Get a set of nightlies for the trunk
2. Keep the VMs running for 6-8 hours of burn in time to make sure we really don't have any more problems.

I wouldn't expect the tree to open before noon PDT, June 9th; it may take a bit longer to get everything back in working order.

I'll post another update at 10 am PDT.
Assignee: justin → preed
Severity: major → blocker
Update:

All four VMs are backed up on local storage on bm-vmware09, and are also booted and running from the NetApp.

I don't know if the RAID reconstruction is still in progress, but I'm guessing it is.

I'd like to run these tinderboxen on the NetApp for at least 8 hours of burn in before we reopen the tree. I'll post another update to see how we're doing at 10 am PDT.
Raid reconstruction is at 84% and things look OK.  If all the machine fsck's came back OK, then we are good.  I'll need another bit of downtime to revert a setting, but a) I need to wait for the raid reconstruction to be done and b) no rush.
Alright... we've got a set of nightlies, and all four affected tinderboxen still look to be building.

We're good there.

Waiting for a green cycle on bl-bldxp01.

Hopefully we'll get that within the next couple of hours; then we can reopen the tree.
Raid rebuild is done.  I need 10-15 min more of downtime, but we can do that whenever is good.  Let's get a good set of build first.
Bug 383875 filed for the test failure on bl-bldxp01.
Bug 383785 is fixed now - bl-bldxp01 is green and has completed 3 cycles with test results similar to before the outage. 

Bug 383919 occurred today but is fixed, so all tinderboxes are reporting correctly.

Do you want to take the downtime before reopening the tree ?
Tree reopened.

(Meant bug 383875 in the previous comment)
Think we can resolve this and track the downtime separately.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.