383742 - Netapp mount netapp-c-01 is having indigestion

Assignee

Description

•

18 years ago

The following VMs went missing early this morning: -- fx-win32-tbox -- argo-vm -- balsa18-branch -- tb-win32-tbox These are all VMs on bm-vmware09, using the netapp-c-01. Other VMs on -09 using the netapp-d-02 mount seem to unaffected (although, it doesn't look like they were booted in the first place). Some other debugging info: From the bm-vmware09 console: [root@bm-vmware09 netapp-d-02]# dir /vmfs/volumes/netapp-d-02/ fx-linux-tbox tb-linux-tbox [root@bm-vmware09 netapp-d-02]# dir /vmfs/volumes/netapp-c-01 dir: /vmfs/volumes/netapp-c-01: No such file or directory [root@bm-vmware09 netapp-d-02]# When trying to view the console of the VMs in question, Virtual Infrastructure manager claims "Virtual Machine config does not exist." The log window on some of the VMs says: at 5:22:54 am, "No bootable CD, floppy, or hard disk was detected." oremj jumped on this, but I've closed the tree due to lacking coverage (and missing nightlies, too) so this is a blocker.

Reed Loden [:reed]

Updated

•

18 years ago

Assignee: server-ops → justin

Justin Fitzhugh

Comment 1

•

18 years ago

Netapp paniced - working with netapp now.

Justin Fitzhugh

Comment 2

•

18 years ago

This has been attributed to a firmware bug - we need to do a file system check which may take a hour or two, but new disks and firmware will be installed. I'll comment once it's all back up. ETA is 2pm (hopefully sooner).

Status: NEW → ASSIGNED

J. Paul Reed [:preed]

Assignee

Comment 3

•

18 years ago

(In reply to comment #2) > ETA is 2pm (hopefully sooner). I believe these killed nightly builds, so we'll need to get those out before reopening the tree. Expect the tree to reopen about 2.5-3 hours after the tinderboxen come back.

Peter van der Woude [:Peter6]

Updated

•

18 years ago

Assignee: justin → server-ops

Status: ASSIGNED → NEW

Peter van der Woude [:Peter6]

Comment 4

•

18 years ago

sorry, misclicked

Assignee: server-ops → justin

Justin Fitzhugh

Comment 5

•

18 years ago

Netapps are both ready to go with new shelf firmware. Please bring up the VMs when you are ready (preed asked me to wait for him)

Severity: blocker → major

J. Paul Reed [:preed]

Assignee

Comment 6

•

18 years ago

Booting VMs now.

J. Paul Reed [:preed]

Assignee

Comment 7

•

18 years ago

Alright, tinderbox running; will reopen tree after the nightlies come out.

Assignee: justin → preed

J. Paul Reed [:preed]

Assignee

Comment 8

•

18 years ago

Netapp crashed again. Back to Justin. He's on the phone with them now.

Assignee: preed → justin

J. Paul Reed [:preed]

Assignee

Comment 9

•

18 years ago

Update: Today, we had a problem with the NetApp-backed VMware machines; four VMs were affected by this outage: -- fx-win32-tbox (Firefox trunk win32) -- tb-win32-tbox (Thunderbird trunk win32) -- balsa-18branch (1.8-branch memory leak tester) -- argo-vm (Firefox trunk linux) Justin spent most of the afternoon on the phone with NetApp, and it seems like we hit a pretty bad firmware bug where we could try to work the problem more, but it's not looking good. As such, we're planning to reformat and rebuild these VMs. For fx-win32-tbox and tb-win32-tbox, these are ref images, so we can pull the image from bm-vmware01, convert them to ESX 3, and re-deploy the tinderboxen. We're going to pull the backup of the balsa image and redeploy it. We have no copy of argo-vm, however physical argo has not been reimaged/redeployed, so the current plan is to move the build trunk linux build back to argo for now. The priority ordering for getting VMs back up is: 1. fx-win32-tbox 2. argo 3. balsa-18branch 4. tb-win32-tbox We'll be working on this tonight, and will post another update around 1:00 am PDT.

J. Paul Reed [:preed]

Assignee

Comment 10

•

18 years ago

Update: After some heroics (that took all of his Friday night), Justin worked with the NetApp guys to get the array in hopefully stable state. The array is currently rebuilding itself, so operations on it are very slow. Based upon the current rebuild rate, it should complete in ~6 hours. While the array is rebuilding, we are backing up the actual instances of each VM affected to another location to minimize dataloss. After we get the first three VMs on the list restored and in working order, we'll need to: 1. Get a set of nightlies for the trunk 2. Keep the VMs running for 6-8 hours of burn in time to make sure we really don't have any more problems. I wouldn't expect the tree to open before noon PDT, June 9th; it may take a bit longer to get everything back in working order. I'll post another update at 10 am PDT.

Assignee: justin → preed

Severity: major → blocker

J. Paul Reed [:preed]

Assignee

Comment 11

•

18 years ago

Update: All four VMs are backed up on local storage on bm-vmware09, and are also booted and running from the NetApp. I don't know if the RAID reconstruction is still in progress, but I'm guessing it is. I'd like to run these tinderboxen on the NetApp for at least 8 hours of burn in before we reopen the tree. I'll post another update to see how we're doing at 10 am PDT.

Justin Fitzhugh

Comment 12

•

18 years ago

Raid reconstruction is at 84% and things look OK. If all the machine fsck's came back OK, then we are good. I'll need another bit of downtime to revert a setting, but a) I need to wait for the raid reconstruction to be done and b) no rush.

J. Paul Reed [:preed]

Assignee

Comment 13

•

18 years ago

Alright... we've got a set of nightlies, and all four affected tinderboxen still look to be building. We're good there. Waiting for a green cycle on bl-bldxp01. Hopefully we'll get that within the next couple of hours; then we can reopen the tree.

Justin Fitzhugh

Comment 14

•

18 years ago

Raid rebuild is done. I need 10-15 min more of downtime, but we can do that whenever is good. Let's get a good set of build first.

Nick Thomas [:nthomas] (UTC+12)

Comment 15

•

18 years ago

Bug 383875 filed for the test failure on bl-bldxp01.

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

18 years ago

Bug 383785 is fixed now - bl-bldxp01 is green and has completed 3 cycles with test results similar to before the outage. Bug 383919 occurred today but is fixed, so all tinderboxes are reporting correctly. Do you want to take the downtime before reopening the tree ?

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

18 years ago

Tree reopened. (Meant bug 383875 in the previous comment)

J. Paul Reed [:preed]

Assignee

Comment 18

•

18 years ago

Think we can resolve this and track the downtime separately.

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

Netapp mount netapp-c-01 is having indigestion

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: preed, Assigned: preed)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated