<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

17 years ago

Assignee: server-ops → mrz

Comment 5

•

17 years ago

Not entirely sure what the action to fix on this is. I don't see anything network-wise to indicate anything. What else would cause this error? cvs checkout: cannot open directory /cvsroot/:ext:ffxbld@cvs.mozilla.org:/cvs: No such file or directory cvs checkout: skipping directory mozilla/extensions/webdav/test Does that mean cvs.mozilla.org couldn't find the file to serve?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 6

•

17 years ago

I see nothing in these errors to indicate it has anything to do with the network. Bug 435052 on the other had clearly points to a possible network hiccup, but totally different times and unreleated (I think). Can this go back to releng due to the lack of anything pointing at the network layer?

Reporter

Comment 7

•

17 years ago

Note: I've tried restarting tinderbox on crazyhorse around 8pm. Its done a few builds at this point, and is still burning.

Assignee

Comment 8

•

17 years ago

kicking back to releng for troubleshooting. feel free to pull us in if you have any evidence this might be infra related.

Assignee: mrz → joduinn

Component: Server Operations → Release Engineering

Assignee

Comment 9

•

17 years ago

so I may be wrong here - per John, all of these are vm's with network storage and all had issues at the same time. given that, this may very well be a network issue if the storage fell out from under the host. still grasping for a reason or direction to diagnose though.

Assignee: joduinn → mrz

Component: Release Engineering → Server Operations

Comment 10

•

17 years ago

The cvs error is the only one that stands out as not local the the host - the others could all be SAN related (corrupt file, memory exhaustion (if it's the page file?). But I don't see anything on any switches or any switch ports going to any of the filers that show anything alarming. Still looking...

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 11

•

17 years ago

fyi, all of these VMs are on netapp-c.

Reporter

Comment 12

•

17 years ago

I've just rebooted crazyhorse VM. Its booting and rechecking disks (very slow!). Maybe this reboot and a clobber build will do the trick?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 13

•

17 years ago

mrz just notified me that fx-linux-tbox is also hung, trying to do a build since 16:04 this afternoon.

Phil Ringnalda (:philor)

Comment 14

•

17 years ago

Dunno if it's related or not, but qm-centos5-01, qm-centos5-03, qm-xserve01, and (a bit less likely) qm-win2k3-pgo01 are failing in what doesn't look to me like a familiar way - the Linux and Mac ones since late morning, the Win-pgo... hard to tell, since it takes forever, and seems to have done one burning build and one "green" with no results build today.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 15

•

17 years ago

(In reply to comment #12) > I've just rebooted crazyhorse VM. Its booting and rechecking disks (very > slow!). Maybe this reboot and a clobber build will do the trick? > Justin pushed crazyhorse through two unhappy fsck's while I was driving home. Crazyhorse is now cleanly rebooted. I've restarted tinderbox, but dont have privs to trigger clobber build. Lets see if depend build goes ok on crazyhorse.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 16

•

17 years ago

(In reply to comment #13) > mrz just notified me that fx-linux-tbox is also hung, trying to do a build > since 16:04 this afternoon. > After mrz killed two "Bug Buddy" instance, and rebooted the VM, now fx-linux-tbox is green again.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 17

•

17 years ago

(In reply to comment #15) > (In reply to comment #12) > > I've just rebooted crazyhorse VM. Its booting and rechecking disks (very > > slow!). Maybe this reboot and a clobber build will do the trick? > > > Justin pushed crazyhorse through two unhappy fsck's while I was driving home. > Crazyhorse is now cleanly rebooted. I've restarted tinderbox, but dont have > privs to trigger clobber build. Lets see if depend build goes ok on crazyhorse. > Success - crazyhorse just turned green!

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 18

•

17 years ago

After the initial errors balsa-18branch hit in comment#0, balsa-branch continues to fail, but now fails with: C mozilla/intl/unicharutil/util/.cvsignore C mozilla/intl/unicharutil/util/Makefile.in C mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp C mozilla/intl/unicharutil/util/nsCompressedCharMap.h C mozilla/intl/unicharutil/util/nsUnicharUtils.cpp C mozilla/intl/unicharutil/util/nsUnicharUtils.h cvs [checkout aborted]: /case.dat/1.1/Fri Jan 8 00:19:24 199/CVSROOT: No such file or directory gmake: *** Conflicts during checkout. C mozilla/intl/unicharutil/util/.cvsignore C mozilla/intl/unicharutil/util/Makefile.in C mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp C mozilla/intl/unicharutil/util/nsCompressedCharMap.h C mozilla/intl/unicharutil/util/nsUnicharUtils.cpp C mozilla/intl/unicharutil/util/nsUnicharUtils.h gmake[1]: *** [real_checkout] Error 1 gmake[1]: Leaving directory `/builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend' gmake: *** [checkout] Error 2 Error: CVS checkout failed. I did the following: - stop multi tinderbox - renamed /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend/mozilla to /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend/mozilla.bad, - start multi tinderbox ...and its taking longer to fail out, which is I guess progress. Stay tuned.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 19

•

17 years ago

Success - balsa-18branch is now green.

Comment 20

•

17 years ago

Attached file /var/log/messsages for production-prometheus-vm — Details

Here are the scsi errors from the system log of production-premetheus-vm, with the dhclient lines removed. Running fsck on it now, which is finding lots of things to fix.

Comment 21

•

17 years ago

I think these machines are all fixed up now, even if we don't know the cause. Some details: production-prometheus-vm (netapp-c-fcal1/bm-vmware11, vmware tools out of date) * required a bunch of fsck runs and reboots to get the partitions fixed up, and vmware tools updated production-pacifica-vm (netapp-c-fcal1/bm-vmware11, vmware tools out of date) * used |chkdsk C:\ /F| to get a check on reboot, no errors found balsa-18branch (netapp-c-001/bm-vmware07, vmware tools out of date) * update vmware tools (switched from rpm to tar.gz install), the reboot from that started running fsck for days-since-last-check * when restarting this box, you have to open a console using the VI client and run ~/start-X-repeatedly.sh (otherwise it fails the tinderbox tests) crazyhorse (netapp-c-001/bm-vmware09, vmware tools out of date) * updated vmware tools automatically, rebooted We also have a problem with: moz2-win32-slave (netapp-c-fcal1/bm-vmware06, vmware tools out of date) * checked C,D, and E disks * updates vmware tools

Comment 22

•

17 years ago

(In reply to comment #21) > We also have a problem with: This should say "had".

Comment 23

•

17 years ago

Attached file Excerpt of System log for production-pacifica-vm — Details

The log isn't super verbose and this is mainly here for timing. The disk error is "The driver detected a controller error on \Device\Harddisk0" The symmpi one is "The device, \Device\Scsi\symmpi1, is not ready for access yet"

Comment 24

•

17 years ago

Random guessing time - Could be a transient network fault, as we guessed yesterday. Or it could be something in the netapp itself. The machines here all seem to be on netapp-c-001 and netapp-c-fcal1 - can I read that as different shelves on the netapp-c backplane/device/host ? Is there anything in the logs for that netapp which indicates a problem ? (eg did some drives die and it's coping best it can ?) I had a quick look at /var/log/messages on linux boxes using netapp-d-001 (l10n-linux-tbox, prometheus-vm, moz2-linux-slave1) and don't find any scsi errors like attachment 322081 [details]. Do seem them for karma, which is on netapp-c-001 and had a couple of red builds yesterday. There are lots of scsi messages in bm-vmware09:/var/log/{vmkernel,vmkwarning}, which seem to start on May 9 - possible fallout from the ESX3.5 upgrade on May8 or just a change in logging messages ? The VI client reports some inaccessible VMs on bm-vmware08: * bm-symbolfetch01 (netapp-c-fcal1) - should have been on * try2-linux-slave, try2-win32-slave (netapp-c-fcal1) - should be on, can see other try VM's so probably not a network sandbox thing * backup-pacifica-vm, backup-prometheus-vm (netapp-c-001) - would have been off * tb-win32-tbox (netapp-c-001) - would have been off Let us know if there's anything we can do to help diagnose.

Assignee

Comment 25

•

17 years ago

Really helpful Nick - thanks. All signs point to a network issue, as we saw this + the disconnects in the morning. The netapp is just exposing a block device so any corruption would most likely happen as a result of the host loosing network access to the netapp, and having issues. I'll talk with mrz more tomorrow morning and dig into the netapp logs to see if I can find anything else...

Assignee

Comment 26

•

17 years ago

from the netapp logs - starting on may 20th, we start seeing a lot of reconnects and Abort tasts from the vmware hosts: Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware09) sent LUN Reset request, aborting all SCSI commands on lun 2 Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware05) sent LUN Reset request, aborting all SCSI commands on lun 1 Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware07) sent LUN Reset request, aborting all SCSI commands on lun 1 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:pm-vmware01 at IP addr 10.253.0.92 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware10 at IP addr 10.253.0.229 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware03 at IP addr 10.253.0.223 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware06 at IP addr 10.253.0.237

Assignee

Comment 27

•

17 years ago

Netapp P1 Case #3101612 opened...

Comment 28

•

17 years ago

We had some more issues with machines on netapp-c today, slightly different symptoms but most likely related to this (I'm just making a note of it really, while we wait for Netapp to respond). fx-linux-tbox - tinderbox crashed out at some point after 20:15 PDT yesterday. The symlink from /bash/sh to bash was broken, but was fixed after a reboot (no disk check was done). balsa-18branch - stopped cycling at 20:25 PDT yesterday, multi-tinderbox was running but wasn't doing anything. Nothing in /var/log/messages since May 18 (odd given the reboots it had thursday). Rebooted it, it didn't do a disk check, started tinderbox, cycled OK. karma - CVS conflicts from 20:09 PDT. Rebooted, fsck triggered, still in progress ...

Comment 29

•

17 years ago

sm-try2-win32-slave has had some trouble too. Lots of errors like this in the system log: The driver detected a controller error on \Device\Harddisk0. I forced a check on all of the disks and rebooted. The problem appears to be gone now.

Assignee

Comment 30

•

17 years ago

yea - I saw this issue while doing some troubleshooting and I think it's definitely related to netapp-c or it's network connection. I took traces and they have a lot of diag data now. It's a P1 case with them - hoping to have something back today.

Assignee: mrz → justin

Comment 31

•

17 years ago

fyi, also trying to find out if I can increase the iscsi timeout on ESX to help cope in the interim.

Comment 32

•

17 years ago

Updating summary

Summary: 4 machines start burning at same time on 1.8 waterfall page → Connection problems to netapp-c causing corruption of tinderbox file systems

Assignee

Comment 33

•

17 years ago

netapp pointed us at a bug in 7.2.2 - upgraded to 7.2.4. fixed. Bug details: Bug ID 226424 Title WAFL mismanagement of network buffers results in poor filer performance. Duplicate of Bug Severity 2 - System barely usable Bug Status Fixed Product Data ONTAP Bug Type WAFL Description Formatted A filer may exhibit poor performance due to WAFL holding on to too many network buffers and not releasing them in a timely fashion. This situation can result in the unavailability of these buffers at the network interface causing the filer to drop packets. Clients must then retransmit the packets until they are accepted causing slower access to the filer. Workaround Formatted Notes Formatted Related Solutions Fixed-In Version * Data ONTAP 7.2.4 (, GD) - Fixed * Data ONTAP 7.2.4L1 (GD) - Fixed * Data ONTAP 7.3RC1 (RC) - Fixed A complete list of releases where this bug is fixed is available here. Related Bugs 145410, 225471

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Comment 34

•

17 years ago

Sorry to say that we're still having problems with the tinderbox. For example tbnewref-win32-tbox on netapp-c-fcal1/bm-vmware06, which started burning at 2008/05/23 21:28 and today at 13:46. Both times were CVS conflicts when there were no checkins to cause them. After the first one I ran chkdsk on all the disks, updated vmware tools, and the next build was a clobber for the nightly; so either chkdsk sucks (plausible :-)) or there is repeated corruption occurring. Also had a kernel panic on fx-linux-tbox (netapp-c-fcal1) sometime after 7:45 today. It got a fsck on each disk and a reboot, and still managed to burn around 11:30 after no checkins. Also had panics on l10n-linux-tbox (netapp-d-002) - will try to get a screen grab of the trace if it happens again (or we could add a serial port to the VM and do console logging). On the Mozilla1.8, the balsa-18branch (netapp-c-001) and production-prometheus-vm machines (netapp-c-fcal1) are troubled. Could you please take a look at the netapp logs to confirm the connections from the VM hosts are still being lost. Hopefully there will be something obvious there. If not, would we have to wait until Tuesday to get hold of Netapp and/or VMware ? We might be able to live with that given that most of the trees are effectively closed at the moment but a potential Firefox 3.0 RC2 looms early next week.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Assignee

Comment 35

•

17 years ago

seeing the reconnects only on netapp-c now, not on -d. Talking with netapp...

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 36

•

17 years ago

Another angle of attack - we started having these problems on Tuesday 2008/05/20 17:52 PDT - balsa-18branch - false compile fail - http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211331120.1211331943.21904.gz 2008/05/21 02:17 PDT - crazyhorse - http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211361420.1211362134.7330.gz Were there any infra changes around then ?

Reporter

Comment 37

•

17 years ago

(In reply to comment #34) > On the Mozilla1.8, the balsa-18branch (netapp-c-001) and > production-prometheus-vm machines (netapp-c-fcal1) are troubled. From looking on the logs, both of these machines are failing out just like they were earlier this week. For now, I'm going to try the same cleanup/restart as I did Wednesday night. But it looks like the upgrade didnt fix the problem. > Could you please take a look at the netapp logs to confirm the connections from > the VM hosts are still being lost. Hopefully there will be something obvious > there. If not, would we have to wait until Tuesday to get hold of Netapp and/or > VMware ? We might be able to live with that given that most of the trees are > effectively closed at the moment but a potential Firefox 3.0 RC2 looms early > next week. Note: People are trying to land the handful of approved patches, in advance of possible FF3.0rc2 builds on Tuesday morning. What else can we do to raise the priority of this problem with netapp/vmware?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 38

•

17 years ago

On each of balsa-18branch, production-prometheus-vm, and production-pacifica-vm, I've done the same steps, like in comment#18: - stop multi tinderbox - renamed existing local cvs mozilla tree /builds/tinderbox/.../mozilla to /builds/tinderbox/.../mozilla.bad, - start multi tinderbox After restart, balsa-18branch, production-pacifica-vm are taking longer to fail out, which is I guess progress. After restart, production-prometheus-vm failed out, same as before. Found that the file /usr/include/bits/sigaction.h was completely corrupted, causing syntax errors whenever it was included in mozilla code. Replacing with undamaged file on staging-prometheus-vm, and looking for other corruptions.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 39

•

17 years ago

balsa-18branch, production-pacifica-vm both now green. :-) On production-prometheus-vm, in /usr/include/bits, I renamed the sigaction.h to sigaction.h.bad, and then copied over sigaction.h from staging-prometheus-vm. I then diff'd the files and saw they were both *fine*, and identical. Original file was no longer corrupted?!?! huh?! Lets see how the next build goes.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 40

•

17 years ago

production-prometheus-vm now green.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 41

•

17 years ago

mrz just pinged me on irc to say that fx-win32-tbox was offline. Found no tinderbox running, started it, but watched the build fail out with corrupt cvs files. Same problem again. Cleaned up local files, and restarted tinderbox.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 42

•

17 years ago

(In reply to comment #37) > What else can we do to raise the priority of this problem with netapp/vmware? Nothing - we have P1 cases with both vendors, working through the weekend. We'll update here when we have something.

Reporter

Comment 43

•

17 years ago

fx-win32-tbox still building, but so far so good. balsa-18branch and production-prometheus-vm are burning again. After a couple of cycles of running green, they hit the same vmware/netapp problem, and are failing out just like before. :-(

Comment 44

•

17 years ago

Added console logging to a file on fx-linux-tbox after this box went read-only For some reason the messages from ext3-FS don't get into /var/log/messages and only appear in the console available thru the VI client. Steps: 1, in /boot/grub/grub.conf, append "console=ttyS0,115200n8 console=tty0" to the kernel line for 2.6.18-53.1.13.el5 2, Shutdown the VM 3, Edit Settings > Add > Serial Port > To File > [netapp-c-fcal1] fx-linux-tbox/console.log (would vastly prefer /tmp on the VM host but couldn't figure out how to refer to that) 4, Restart VM The log is at /vmfs/netapp-c-fcal1/fx-linux-tbox/console.log. There's no timestamps, looks like it's a kernel compile option :-(.

Summary: Connection problems to netapp-c causing corruption of tinderbox file systems → Connection problems to netapp's causing corruption of tinderbox file systems

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 45

•

17 years ago

Both fx-linux-tbox and fx-win32-tbox have been up and down all day, so I think the resets are getting worse. The windows box has taken to rebooting itself (or perhaps that's someone else). This is the tip of the iceberg, but these are two machines that we have to keep running. I've given them one last go with a reboot, fsck/chkdsk all the partitions, remove objdir, & start tinderbox, but I suspect they'll die/burn/kill-a-kitten after as few as 3 builds. Am I missing something profoundly obvious in resuscitating these machines ? Perhaps disk checks are dangerous when connectivity is intermittent, but I've not seen any of the scsi errors while actually using a linux console. Can we use the netapp/vmware logs to figure out if a particular box is causing this ? Or if one (or more) is broken and making it worse ? If not, it's time to close the Firefox tree.

Reporter

Comment 46

•

17 years ago

(In reply to comment #36) > Another angle of attack - we started having these problems on Tuesday [snip] > Were there any infra changes around then ? > Really good question, Nick. I dont know of any changes. Justin? mrz?

Status: REOPENED → ASSIGNED

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 47

•

17 years ago

nothing except the ontap upgrade. still working with netapp/vmware - support is less than stellar given the holiday. if there are a few critical boxes, they should be storage-vmotioned to netapp-d or -a. note this will only work for a few as there is not a lot of spare storage.

Reporter

Comment 48

•

17 years ago

(In reply to comment #45) > Can we use the netapp/vmware logs to figure out if a particular box is causing > this ? Or if one (or more) is broken and making it worse ? If not, it's time to > close the Firefox tree. On irc, Nick and I decided we might as well close the tree now. Machines start burning again as fast as we can repair them. Over the last few days, its been kinda impossible to find a non-burning time to land anyway, so closing tree seems academically honest. If things stabilize, we'll happily reopen. Note: this problem is impacting machines on Mozilla1.8, Mozilla1.9 and Mozilla2. I just closed Mozilla1.9 now. I would have had to close Mozilla1.8 and Mozilla2 also, but they were already closed for other reasons.

Assignee

Comment 49

•

17 years ago

so, think I have stabilized the situation for the moment. we failed over to the secondary head after a lot of diagnosis, so all the arrays are being served off the one head, which is fine as load is not an issue. we haven't seeing the resets/reconnects for over an hour (used to happen ever 10 min), so we'll monitor through the night. action plan for tomorrow is to take a core of the head that was having issues, and diagnose once we are sure the issue is isolated to the single head. in the mean time, we need normal production load thrown at the array, and I don't have any idea what state the machines are in. build - can you please bring up all the tinderboxen and monitor for crashes? please list out which machines crash, if any do.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 50

•

17 years ago

still not a single error - need build to verify vm's are up and running with no issues. please verify asap.

Assignee: justin → joduinn

Status: ASSIGNED → NEW

Reporter

Comment 51

•

17 years ago

I'm now home, and will start looking through them all. If I can get them all up and running ok, I'll reopen the tree to see how it goes. Stay tuned...

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 52

•

17 years ago

balsa-18branch production-prometheus-vm production-pacifica-vm fx-linux-tbox ...were all burning because of the same disk corruption errors from above. I've now cleaned up the filesystems for these 4 VMs, and restarted the slaves. Lets see how these builds go.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 53

•

17 years ago

These VMs are all now green, so I've reopened the tree. Lets see if the work in comment#49 holds up to the load. (John crosses his fingers)

Comment 54

•

17 years ago

Starting to bring up other VMs now.

Comment 56

•

17 years ago

Back in action: l10n-win32-tbox (also updated vmware tools) l10n-linux-tbox tbnewref-win32-tbox xr-win32-tbox Checking disks now: crazyhorse karma cerberus-vm

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 57

•

17 years ago

netapp ran clean all night - can you guys confirm no crashes? now, next step in the investigation, I need to fail back to netapp-c (no downtime), wait for the issue, and get a core so they can figure out what is going on - then we'll go back to a steady state on -d while they analyze. I know this is quite disruptive, so we can do this either today when one of you has some time to make sure things come back or on Tuesday - let me know what works best for your schedule and the build schedule. sorry for the hassle here, but I think we are getting close to a resolution.

Reporter

Comment 58

•

17 years ago

VMs stayed up and stable overnight, so lets leave well enough alone for now. Thanks for all the help over the holiday weekend! Tomorrow (Tuesday) morning, we'll have a go/nogo decision for rc2. If we do not need rc2, we can move back to netapp-c immediately. If we do need rc2, we'll need to wait until after builds&updates are handed to QA before we can move back to netapp-c, so likely Thursday.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 59

•

17 years ago

Per Firefox3 meeting just now, this is on hold until after we produce FF3.0rc2 builds.

Priority: P1 → P3

Scott Holodak

Comment 60

•

17 years ago

I stumbled on this while looking for an update on FF3... My org went through a string of very similar issues last September on ESX 3.0.1/3.0.2 connecting to a NetApp. It affected both us using both iSCSI and NFS from the ESX kernel and from within the VM's. The problem with 3.0.2 was MUCH worse than with 3.0.1... we've rolled back to 3.0.1. What version were you at pre-3.5? I'm 99% sure this is going to turn out to be problem with VMware and/or other hardware, not the NetApp (we had the same issue with a CoRAID device and with a NetApp at the same time). The May 8th ESX upgrade & May 9th error messages are most likely not a coincidence. Be careful putting your investigations on hold for a long period of time with the VM's running. Disk errors at the VM level _can_ end up corrupting the VM's OS--sometimes to the point where the OS can't boot anymore (particularly if you're dropping iSCSI connections altogether). A few things to look for: - Timing: clock drift, NTP problems, kerberos problems (can impact iSCSI connections), incorrect time/date etc. on VM's and on ESX servers. - Network: . any recent changes? upgrades? re-configurations? firewall policies? . switch/router health/load . MTU settings/capabilities . duplex settings . cables (don't overlook this) . trace routes from VMs to NetApp, from ESX to NetApp, etc. Anything unusual? - Web servers: incomplete/truncated downloads (particularly on larger files or when the network is under heavy load). - DB Servers: check for anything unusual in the DB logs. - ESX Servers: . try copying a large file from one of the iSCSI mount points to another location a few times (i.e., a large .vmdk file). Look for messages like "cp: reading `vm_name-flat.vmdk': Input/output error" . how and where are you mounting your iSCSI connections? With VMware's software iSCSI Initiator? hardware initiator? Within VM's? - Linux VMs: . filesystems unmounting and re-mounting as read-only? . files disappearing/re-appearing/changing sizes sporadically, particularly on network-based file systems (NFS in our case). - Windows VMs (Event Log Viewer): . symmpi:15:"The device, \Device\Scsi\symmpi1, is not ready for access yet." . Disk:11:"The driver detected a controller error on \Device\Harddisk0." . Application Popup:333:"An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system's image of the Registry." . iScsiPrt:7:"The initiator could not send an iSCSI PDU. Error status is given in the dump data." (if you are making iSCSI connections from within a VM) . iScsiPrt:20:Connection to the target was lost. The initiator will attempt to retry the connection. Our attempted solutions: - Rollback to 3.0.1 - Test/replace network cables - Upgrade network equipment (jumbo frame support, dedicated gigabit ports) - Completely isolate storage-related networks on their own dedicated switches (we're still in the process of doing this) - Switch to fiber (ideally) or NFS from iSCSI where possible. We're not really sure why the problem went away. We fixed as many things as we could identify as possible causes. The biggest impact we could observe was when we rolled back to ESX 3.0.1. Hope this helps & sorry to read this. I'm curious to see how things work out for you... Hopefully it won't be as painful an ordeal for you as it was for us.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 61

•

17 years ago

(In reply to comment #59) > Per Firefox3 meeting just now, this is on hold until after we produce FF3.0rc2 > builds. > ok, builds and updates are now generated. Justin, sorry today's release took longer then expected, its now ok to try those NetApp diagnostic tests.

Assignee: joduinn → justin

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 62

•

17 years ago

(In reply to comment #60) hi Scott; > I stumbled on this while looking for an update on FF3... My org went through a > string of very similar issues... [snip] > Be careful putting your investigations on hold for a long period of time with > the VM's running. Disk errors at the VM level _can_ end up corrupting the VM's > OS--sometimes to the point where the OS can't boot anymore (particularly if > you're dropping iSCSI connections altogether). Once we got things stable over weekend, we decided to hold off investigations until this evening, because it was being so disruptive to VMs on tinderbox... hence disruptive to developers trying to land last few patches for FF3.0rc2. Now that FF3.0rc2 is fully built and handed off to QA, we're back looking at this again. Its quite a worry. > A few things to look for: [snip] > - Linux VMs: > . filesystems unmounting and re-mounting as read-only? > . files disappearing/re-appearing/changing sizes sporadically, particularly > on network-based file systems (NFS in our case). We have been hit by this for a while now. The VMware response was to have us do kernel updates. See bug#407796 for details. That might or might have solved the intermittent problem, hard to tell with the recent mischief from NetApp. [snip] > Hope this helps & sorry to read this. I'm curious to see how things work out > for you... Hopefully it won't be as painful an ordeal for you as it was for us. Your comments were very helpful, thanks for all the info. Sounds like an ugly experience, I also hope its not as painful for us!! tc John.

Priority: P3 → P1

Comment 63

•

17 years ago

Justin, what's the current status on this ? Is netapp-c in normal mode or failed over to -d at the moment ? We're seeing VMs with problems over the last day or so, eg bug 407796 comment #86 and #87 but also some compile errors which look like corruption.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 64

•

17 years ago

Escalated this more and got more information. Netapp believes it's an issue with the LUN. Current plan of action is to storage vmotion all VMs off a LUN (without downtime) to a another array we have, rebuild the LUN then move the VMs back. Storage vmotion takes some time so this is a slow process. If there are any VMs that could stand an hour of downtime, please let us know so we can move them faster. Given the issue is with the LUN itself, netapp seems confident failing to netapp-d won't solve the issues (as we still saw them in the failed over state). I'll ping you guys today in IRC to go over the plan more.

Reporter

Comment 65

•

17 years ago

Approx 10.05 this morning, staging-pacifica-vm02 hit: remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ...which looks like it lost connection to staging-1.8-master. The next few attempts to build failed out with different errors. By approx 10.30 this morning, staging-pacifica-vm02 was back running normally again. Are either of these VMs on the suspect LUN?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 66

•

17 years ago

Yes. Any VM with a datastore name containing "netapp" in it is.

Reporter

Comment 67

•

17 years ago

Discovered qm-centos5-moz2-01 down just now, with a linux scsi kernel panic. Unclear how long its been down. Justin restarted VM.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 68

•

17 years ago

qm-win2k3-moz2-01 was broken, first with: remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ...and then with file permission errors trying to delete files, causing all builds to fail. I've restarted the VM.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 69

•

17 years ago

qm-win2k3-moz2-01 = netapp-c-fcal1, expected. Thanks John.

Comment 70

•

17 years ago

Looks like prod-pacifica-vm is hitting this again?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 74

•

17 years ago

prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp woes. I'll start to repair and restart these VMs right now, but unclear how long they will remain up before being hit by this problem again.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 75

•

17 years ago

(In reply to comment #74) > prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp > woes. > > I'll start to repair and restart these VMs right now, but unclear how long they > will remain up before being hit by this problem again. > We are blocked on making progress on the core netapp issue as we can't move the last two machines on the shelf per build. On hold till Monday.

Reporter

Comment 76

•

17 years ago

prod-pacifica-vm and balsa-18branch now repaired and restarted. They're past where they were failing before, and I hope are now ok. I'll keep watching over the next few hours. crazyhorse is still going through fsck problems with corrupted superblocks.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 77

•

17 years ago

(In reply to comment #75) > (In reply to comment #74) > > prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp > > woes. > > > > I'll start to repair and restart these VMs right now, but unclear how long they > > will remain up before being hit by this problem again. > > > > We are blocked on making progress on the core netapp issue as we can't move the > last two machines on the shelf per build. On hold till Monday. Errr... I thought the only VMs not being moved where the buildbot masters controlling all the talos and unittest machines. What happened to the other VMs thats were in transition Friday - are they now moved? And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm and balsa-18branch) on - and can these be moved?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 78

•

17 years ago

balsa-18branch failed out again. It made it further then comment#74, but failed out with similar errors. While repairing again, I hit this weirdness: $ pwd /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend $ mv mozilla mozilla.bad.002 $ mkdir mozilla mkdir: cannot create directory `mozilla': File exists $ $ ls -la ... drwxrwxrwx 11 cltbld users 4096 Jun 7 19:38 mozilla drwxrwxrwx 53 cltbld users 4096 Jun 7 18:46 mozilla.bad.001 drwxr-xr-x 58 cltbld users 4096 Jun 7 19:38 mozilla.bad.002 ... After 30+ seconds the "mozilla" directory disappeared?!?! Since when are "mv" operations not atomic? I've now re-rebooted balsa-18branch and re-restarted tinderbox on it. Fingers crossed.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 79

•

17 years ago

crazyhorse is now up and building. It took a few fsck-with-repairs and reboot cycles by Trevor, and also two fsck-with-repairs and reboot cycles by me. Build started, lets see how it goes. Fingers crossed.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 80

•

17 years ago

> Errr... I thought the only VMs not being moved where the buildbot masters > controlling all the talos and unittest machines. > > What happened to the other VMs thats were in transition Friday - are they now > moved? And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm > and balsa-18branch) on - and can these be moved? They were all done Saturday morning - just waiting on those last two to re-build the LUN.

Reporter

Comment 81

•

17 years ago

production-prometheus-vm.build.m.o started failing out at 00:07 with file access problems and then cvs refresh problems, one of the common symptoms we've seen recently with this NetApp issue. Without any action from me, production-prometheus-vm is now passing the point where it would usually fail. It looks like it might have self-recovered somehow, so I'm not going to kill/restart it.

Samuel Sidler (old account; do not CC)

Comment 82

•

17 years ago

balsa-18branch, crazyhorse, production-prometheus-vm, and prod-pacifica-vm are all red right now. Three of those build nightlies.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 83

•

17 years ago

(In reply to comment #80) > > Errr... I thought the only VMs not being moved where the buildbot masters > > controlling all the talos and unittest machines. > > > > What happened to the other VMs thats were in transition Friday - are they now > > moved? > They were all done Saturday morning - just waiting on those last two to > re-build the LUN. Great. > > And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm > > and balsa-18branch) on - and can these be moved? These failed again. Question remains... can these be moved also or do we have to leave them where they are until after LUN repair?

Samuel Sidler (old account; do not CC)

Comment 84

•

17 years ago

(In reply to comment #83) > These failed again. Question remains... can these be moved also or do we have > to leave them where they are until after LUN repair? If that question is for me, my question back would be: How long of a down time are we talking about to move them? If it's less than six or so hours, move them now so we can get nightlies tomorrow and hopefully give a go to release.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 86

•

17 years ago

(In reply to comment #85) > *** Bug 437893 has been marked as a duplicate of this bug. *** Loss of qm-rhel02 in bug#437893 caused Talos machines to go offline, and close moz2 and mozilla1.9 branches. (In reply to comment #84) > (In reply to comment #83) > > These failed again. Question remains... can these be moved also or do we have > > to leave them where they are until after LUN repair? > > If that question is for me, my question back would be: How long of a down time > are we talking about to move them? If it's less than six or so hours, move them > now so we can get nightlies tomorrow and hopefully give a go to release. Question was for Justin. I've since talked to Justin on phone. Once he's finished moving the masters (above), he'll start moving these slaves also. Because of the sick LUN, there are limits on how many VMs he can migrate at a time. Justin will update this bug with info as the VMs are migrated.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 87

•

17 years ago

Specifically qm-rhel02 and production-master are both now down, so they can be moved to healthy NetApp.

Assignee

Comment 88

•

17 years ago

both of these have been moved - rebuilding the shelf now. After this, one more to go.

Comment 89

•

17 years ago

Fixed up balsa-18branch this morning, crazyhorse has a fsck in progress. Attempted to move production-pacifica-vm to d-sata-build-003 but it failed with "The virtual disk is either corrupted or not a supported format." Ran chkdsk and got the same error again after getting to a few tens of percent. Is this really a network error during the copy ? Also attempted to "Complete migration" on production-prometheus-vm, it failed out in with similar messages to Administrator's attempts (=justin?).

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

17 years ago

Blocks: 437257

Reporter

Comment 90

•

17 years ago

We just powered off all VMs on netapp-c-fcal, see list below. The trees are closed anyway because of failing VMs, so this seems worth trying. In theory this should make it faster to move them off this LUN. bm-symbolfetch01 bm-win2k3-pgo01 build-console fx-linux-tbox fx-win32-tbox l10n-win32-tbox mobile-linux-slave1 moz2-master moz2-win32-slave1 production-pacifica-vm production-prometheus-vm production-trunk-automation qm-centos5-moz2-01 qm-moz2-unittest01 qm-win2k3-moz2-01 qm-win2k3-stage-pgo01 staging-build-console staging-pacifica-vm staging-trunk-automation tbnewref-win32-tbox try2-linux-slave try2-win32-slave

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 91

•

17 years ago

Justin and mrz think they've nailed it (as of 11pm-ish). A few remaining VMs still need to be moved from netapp-c-fcal1: l10n-win32-tbox moz2-win32-slave1 qm-win2k3-stage-pgo01 staging-pacifica-vm staging-1.9-master tbnewref-win32-tbox try2-linux-slave try2-win32-slave There are known issues with karma and prometheus that need to be fixed (from irc with justin). The production-1.9-master and fx-linux-1.9-slave2 VMs are both confirmed ok, and now in use.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 92

•

17 years ago

We'll now start bringing back up all the various build/unittest/talos machines, repairing corrupted file-systems as needed. Depending on how many VMs need how much repair, we expect to start reopening the 1.8/1.9/moz2 trees sometime Tuesday. Watch this space.

Comment 93

•

17 years ago

(In reply to comment #91) > Justin and mrz think they've nailed it (as of 11pm-ish). A few remaining VMs > still need to be moved from netapp-c-fcal1: > l10n-win32-tbox > moz2-win32-slave1 > qm-win2k3-stage-pgo01 > staging-pacifica-vm > staging-1.9-master > tbnewref-win32-tbox > try2-linux-slave > try2-win32-slave > Looks like these have all been moved. l10n-win32-tbox is still shutdown. Can we start it again?

Comment 94

•

17 years ago

tb-linux-tbox is in a kernel panic right now. It's still at least partly on netapp-c-001 so I'll wait for an OK before restarting it.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 95

•

17 years ago

The only remaining misconfigured LUN is netapp-c-fcal1. netapp-c-001 should be fine. Oh, and it's only on netapp-c-001 in the sense that it's cdrom drive is some ISO image off that datastore which I can't seem to change while the host is running (or perhaps because vmware tools isn't yet running).

Reporter

Comment 96

•

17 years ago

At this point, all machines/VMs are back up, and staying up, *except*: 1.9 and moz2 ============ qm-rhel02 (runs talos, unittest) 1.8 === crazyhorse karma production-prometheus-vm We're making progress, but will continue to keep all these trees closed for now. As we get VMs repaired, we'll update this bug.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 97

•

17 years ago

We seem to have sufficient box coverage on the Mozilla2 tinderbox now. Do we need to keep the tree closed?

alice nodelman [:alice] [:anode]

Comment 98

•

17 years ago

It is of a lesser priority since it is a staging machine, but I'm still missing qm-buildbot01.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 99

•

17 years ago

(In reply to comment #96) > At this point, all machines/VMs are back up, and staying up, *except*: > > 1.9 and moz2 > ============ > qm-rhel02 (runs talos, unittest) ...has been working without going read-only since mid morning. Looks like the recurring read-only-ness might be resolved also by last nights fix? We've let a few unittest and talos runs go through to verify that all are connected and working again. It all looks good, so I've now reopened the tree for mozilla-central. Work continues on reparing the VMs needed for 1.8.

Rob Campbell [:rc] (:robcee)

Comment 100

•

17 years ago

(In reply to comment #98) > It is of a lesser priority since it is a staging machine, but I'm still missing > qm-buildbot01. Totally left off the RADAR - it's on a classic QA ESX server. Might as well take the opportunity to move the VM over to Build land.

Comment 101

•

17 years ago

I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's misbehaving. I was unable to resurrect it after the event this morning.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 102

•

17 years ago

(In reply to comment #100) > (In reply to comment #98) > > It is of a lesser priority since it is a staging machine, but I'm still missing > > qm-buildbot01. > > Totally left off the RADAR - it's on a classic QA ESX server. Might as well > take the opportunity to move the VM over to Build land. The lives on bm-vmware02 now and it's up.

Reporter

Comment 103

•

17 years ago

(In reply to comment #96) > 1.8 > === > crazyhorse tracked in bug#437798 > karma > production-prometheus-vm tracked in bug#438386 (In reply to comment #101) > I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's > misbehaving. I was unable to resurrect it after the event this morning. Is qm-centos5-01 still a problem? Can it be repaired or do we need to create new clone?

Rob Campbell [:rc] (:robcee)

Comment 104

•

17 years ago

(In reply to comment #103) > (In reply to comment #101) > > I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's > > misbehaving. I was unable to resurrect it after the event this morning. > Is qm-centos5-01 still a problem? Can it be repaired or do we need to create > new clone? Based on my efforts with it yesterday, I think it will need to be rebuilt. Maybe mrz can resurrect it where I have failed though.

Rob Campbell [:rc] (:robcee)

Comment 105

•

17 years ago

qm-vista02 on qm-vmware01 is dead in the water. Unable to power it on or access it via the VMWare console or RDP.

Scott Holodak

Comment 106

•

17 years ago

Does it say (inaccessible) after it in Virtual Infrastructure client? Try to remove it from inventory, rename the directory the VM files are in, and manually re-add it to the inventory. If it won't re-add, see if you can recover or move the VM's folder to another location and re-add it from a different ESX server/cluster. You're probably in better shape if it won't even start up than you are if the VM starts with BIOS or OS errors about the disk.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 107

•

17 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=438664 looking to re-image qm-centos5-02 as it is being erratic and not building well at all.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 108

•

17 years ago

(In reply to comment #105) > qm-vista02 on qm-vmware01 is dead in the water. Unable to power it on or access > it via the VMWare console or RDP. thought that was in another bug - should be up.

Updated

•

17 years ago

Blocks: 438664

Assignee

Updated

•

17 years ago

No longer blocks: 438664

Depends on: 438664

Updated

•

17 years ago

Depends on: 437798

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 109

•

17 years ago

karma is up and running again

Reporter

Comment 110

•

17 years ago

From bugs and irc, heres the list of VMs reported as problems and being repaired today. Most are now fixed, and confirmed working ok. [ ] crazyhorse [+] karma [+] production-prometheus-vm [+] qm-buildbot01 [+] qm-vista02 [+] qm-centos5-01 [ ] qm-centos5-02 We just have qm-centos5-02 and crazyhorse being worked on.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 111

•

17 years ago

crazyhorse is working. [+] crazyhorse [+] karma [+] production-prometheus-vm [+] qm-buildbot01 [+] qm-vista02 [+] qm-centos5-01 [ ] qm-centos5-02 qm-centos5-02 is still being worked on, it looks like an application problem, but still investigating.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 112

•

17 years ago

qm-centos5-02 is working. [+] crazyhorse [+] karma [+] production-prometheus-vm [+] qm-buildbot01 [+] qm-vista02 [+] qm-centos5-01 [+] qm-centos5-02 Closing!

Status: NEW → RESOLVED

Closed: 17 years ago → 17 years ago

Resolution: --- → FIXED

Daniel Holbert [:dholbert]

Comment 113

•

17 years ago

For the past ~24 hours, there seem to have been issues with qm-win2k3-pgo01 (bug 440531) and qm-xserve01 (bug 440536). I just filed those two bugs for these issues, but I'm mentioning it here as well, per this message on the Tinderbox page: > ... but if you see weirdness, not caused by checkins > please add details to bug bug 435134 .