Closed Bug 435134 Opened 16 years ago Closed 16 years ago

Connection problems to netapp's causing corruption of tinderbox file systems

Categories

(mozilla.org Graveyard :: Server Operations, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: justin)

References

Details

Attachments

(2 files)

On http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla1.8, the following machines are burning. Interesting they all started burning around 05:57am-06:10am this morning. Even though they all give different errors, the timing makes me wonder if these machines are failing for a related reason?

 balsa-18branch
  http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211416560.1211416689.25547.gz&fulltext=1 ends with:
  cvs checkout: cannot open directory /cvsroot/:ext:ffxbld@cvs.mozilla.org:/cvs: No such file or directory
  cvs checkout: skipping directory mozilla/extensions/webdav/tests


 crazyhorse 
  http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211415900.1211416021.23668.gz&fulltext=1 ends with:
   cc1plus: internal compiler error: Segmentation fault


 prod-pacifica-vm
  http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211415600.1211416259.24280.gz&fulltext=1 ends with:
  ../../dist/lib/gkxtfbase_s.lib : fatal error LNK1136: invalid or corrupt file
Severity: normal → blocker
Priority: -- → P1
To add to the fun of "random" failures, crazyhorse actually had five failures earlier, from 02:17 to 03:03, claiming rather massive CVS troubles with its tree, though the nightly clobber cleared it up, and prod-pacifica-vm failed on its first try at a nightly, dying with "make[5]: *** read jobs pipe: No such file or directory.  Stop.", either or both of which may or may not be connected.
production-prometheus-vm just started burning with the following error:

lib/libxulapp_s.a  -L../../dist/lib -lmozpng -L../../dist/lib -lmozjpeg -L../../dist/lib -lmozz  -L-L../../dist/bin -L../../dist/lib -lcrmf -lsmime3 -lssl3 -lnss3 -lsoftokn3   -lmozcairo -lmozlibpixman   -L/usr/X11R6/lib -lXrender -lX11  -lfontconfig -lfreetype -L/usr/X11R6/lib -lXt -L/usr/X11R6/lib   -lXft -lX11 -lfreetype -lXrender -lfontconfig   -L../../dist/lib -lxpcom_compat    
../../dist/lib/components/libgklayout.a: could not read symbols: Memory exhausted
collect2: ld returned 1 exit status


Summary: 3 machines start burning at same time on 1.8 waterfall page → 4 machines start burning at same time on 1.8 waterfall page
I note there have been no checkins for days on that branch, so not clear what is causing all these machines to start burning now.
Argh... and a bunch of talos machines off also? Is this bug a dup of bug#435052?
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → justin
Assignee: server-ops → mrz
Not entirely sure what the action to fix on this is.  I don't see anything network-wise to indicate anything.  

What else would cause this error?

  cvs checkout: cannot open directory
/cvsroot/:ext:ffxbld@cvs.mozilla.org:/cvs: No such file or directory
  cvs checkout: skipping directory mozilla/extensions/webdav/test

Does that mean cvs.mozilla.org couldn't find the file to serve?
I see nothing in these errors to indicate it has anything to do with the network.  Bug 435052 on the other had clearly points to a possible network hiccup, but totally different times and unreleated (I think).  Can this go back to releng due to the lack of anything pointing at the network layer?
Note: I've tried restarting tinderbox on crazyhorse around 8pm. Its done a few builds at this point, and is still burning.
kicking back to releng for troubleshooting.  feel free to pull us in if you have any evidence this might be infra related.
Assignee: mrz → joduinn
Component: Server Operations → Release Engineering
so I may be wrong here - per John, all of these are vm's with network storage and all had issues at the same time.  given that, this may very well be a network issue if the storage fell out from under the host.  still grasping for a reason or direction to diagnose though.
Assignee: joduinn → mrz
Component: Release Engineering → Server Operations
The cvs error is the only one that stands out as not local the the host - the others could all be SAN related (corrupt file, memory exhaustion (if it's the page file?).  But I don't see anything on any switches or any switch ports going to any of the filers that show anything alarming.  

Still looking...
fyi, all of these VMs are on netapp-c.
I've just rebooted crazyhorse VM. Its booting and rechecking disks (very slow!). Maybe this reboot and a clobber build will do the trick?
mrz just notified me that fx-linux-tbox is also hung, trying to do a build since 16:04 this afternoon.
Dunno if it's related or not, but qm-centos5-01, qm-centos5-03, qm-xserve01, and (a bit less likely) qm-win2k3-pgo01 are failing in what doesn't look to me like a familiar way - the Linux and Mac ones since late morning, the Win-pgo... hard to tell, since it takes forever, and seems to have done one burning build and one "green" with no results build today.
(In reply to comment #12)
> I've just rebooted crazyhorse VM. Its booting and rechecking disks (very
> slow!). Maybe this reboot and a clobber build will do the trick?
> 
Justin pushed crazyhorse through two unhappy fsck's while I was driving home.
Crazyhorse is now cleanly rebooted. I've restarted tinderbox, but dont have
privs to trigger clobber build. Lets see if depend build goes ok on crazyhorse.
(In reply to comment #13)
> mrz just notified me that fx-linux-tbox is also hung, trying to do a build
> since 16:04 this afternoon.
> 
After mrz killed two "Bug Buddy" instance, and rebooted the VM, now fx-linux-tbox is green again. 
(In reply to comment #15)
> (In reply to comment #12)
> > I've just rebooted crazyhorse VM. Its booting and rechecking disks (very
> > slow!). Maybe this reboot and a clobber build will do the trick?
> > 
> Justin pushed crazyhorse through two unhappy fsck's while I was driving home.
> Crazyhorse is now cleanly rebooted. I've restarted tinderbox, but dont have
> privs to trigger clobber build. Lets see if depend build goes ok on crazyhorse.
> 
Success - crazyhorse just turned green!
After the initial errors balsa-18branch hit in comment#0, balsa-branch continues to fail, but now fails with:

C mozilla/intl/unicharutil/util/.cvsignore
C mozilla/intl/unicharutil/util/Makefile.in
C mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp
C mozilla/intl/unicharutil/util/nsCompressedCharMap.h
C mozilla/intl/unicharutil/util/nsUnicharUtils.cpp
C mozilla/intl/unicharutil/util/nsUnicharUtils.h
cvs [checkout aborted]: /case.dat/1.1/Fri Jan  8 00:19:24 199/CVSROOT: No such file or directory
gmake: *** Conflicts during checkout.
C mozilla/intl/unicharutil/util/.cvsignore
C mozilla/intl/unicharutil/util/Makefile.in
C mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp
C mozilla/intl/unicharutil/util/nsCompressedCharMap.h
C mozilla/intl/unicharutil/util/nsUnicharUtils.cpp
C mozilla/intl/unicharutil/util/nsUnicharUtils.h
gmake[1]: *** [real_checkout] Error 1
gmake[1]: Leaving directory `/builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend'
gmake: *** [checkout] Error 2
Error: CVS checkout failed.


I did the following:
- stop multi tinderbox
- renamed /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend/mozilla to /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend/mozilla.bad, 
- start multi tinderbox
...and its taking longer to fail out, which is I guess progress. Stay tuned.
Success - balsa-18branch is now green.
Here are the scsi errors from the system log of production-premetheus-vm, with the dhclient lines removed. Running fsck on it now, which is finding lots of things to fix.
I think these machines are all fixed up now, even if we don't know the cause. Some details:

production-prometheus-vm (netapp-c-fcal1/bm-vmware11, vmware tools out of date)
* required a bunch of fsck runs and reboots to get the partitions fixed up, and vmware tools updated

production-pacifica-vm (netapp-c-fcal1/bm-vmware11, vmware tools out of date)
* used |chkdsk C:\ /F| to get a check on reboot, no errors found

balsa-18branch (netapp-c-001/bm-vmware07, vmware tools out of date)
* update vmware tools (switched from rpm to tar.gz install), the reboot from that started running fsck for days-since-last-check
* when restarting this box, you have to open a console using the VI client and run ~/start-X-repeatedly.sh (otherwise it fails the tinderbox tests)

crazyhorse (netapp-c-001/bm-vmware09, vmware tools out of date)
* updated vmware tools automatically, rebooted

We also have a problem with:
moz2-win32-slave (netapp-c-fcal1/bm-vmware06, vmware tools out of date)
* checked C,D, and E disks
* updates vmware tools


 
(In reply to comment #21)
> We also have a problem with:

This should say "had".
The log isn't super verbose and this is mainly here for timing. The disk error is 
  "The driver detected a controller error on \Device\Harddisk0" 
The symmpi one is
  "The device, \Device\Scsi\symmpi1, is not ready for access yet"
Random guessing time -

Could be a transient network fault, as we guessed yesterday. Or it could be something in the netapp itself. The machines here all seem to be on netapp-c-001 and netapp-c-fcal1 - can I read that as different shelves on the netapp-c backplane/device/host ? Is there anything in the logs for that netapp which indicates a problem ? (eg did some drives die and it's coping best it can ?)

I had a quick look at /var/log/messages on linux boxes using netapp-d-001 (l10n-linux-tbox, prometheus-vm, moz2-linux-slave1) and don't find any scsi errors like attachment 322081 [details]. Do seem them for karma, which is on netapp-c-001 and had a couple of red builds yesterday.

There are lots of scsi messages in bm-vmware09:/var/log/{vmkernel,vmkwarning}, which seem to start on May 9 - possible fallout from the ESX3.5 upgrade on May8 or just a change in logging messages ?

The VI client reports some inaccessible VMs on bm-vmware08:
* bm-symbolfetch01 (netapp-c-fcal1) - should have been on
* try2-linux-slave, try2-win32-slave (netapp-c-fcal1) - should be on, can see other try VM's so probably not a network sandbox thing
* backup-pacifica-vm, backup-prometheus-vm (netapp-c-001) - would have been off
* tb-win32-tbox (netapp-c-001) - would have been off

Let us know if there's anything we can do to help diagnose.
Really helpful Nick - thanks.  All signs point to a network issue, as we saw this + the disconnects in the morning.  The netapp is just exposing a block device so any corruption would most likely happen as a result of the host loosing network access to the netapp, and having issues.  I'll talk with mrz more tomorrow morning and dig into the netapp logs to see if I can find anything else...
from the netapp logs - starting on may 20th, we start seeing a lot of reconnects and Abort tasts from the vmware hosts:

Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware09) sent LUN Reset request, aborting all SCSI commands on lun 2
Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware05) sent LUN Reset request, aborting all SCSI commands on lun 1
Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware07) sent LUN Reset request, aborting all SCSI commands on lun 1

Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:pm-vmware01 at IP addr 10.253.0.92
Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware10 at IP addr 10.253.0.229
Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware03 at IP addr 10.253.0.223
Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware06 at IP addr 10.253.0.237
Netapp P1 Case #3101612 opened...
We had some more issues with machines on netapp-c today, slightly different symptoms but most likely related to this (I'm just making a note of it really, while we wait for Netapp to respond).

fx-linux-tbox - tinderbox crashed out at some point after 20:15 PDT yesterday. The symlink from /bash/sh to bash was broken, but was fixed after a reboot (no disk check was done).

balsa-18branch -  stopped cycling at 20:25 PDT yesterday, multi-tinderbox was running but wasn't doing anything. Nothing in /var/log/messages since May 18 (odd given the reboots it had thursday). Rebooted it, it didn't do a disk check, started tinderbox, cycled OK.

karma - CVS conflicts from 20:09 PDT. Rebooted, fsck triggered, still in progress ...
sm-try2-win32-slave has had some trouble too. Lots of errors like this in the system log:
The driver detected a controller error on \Device\Harddisk0.

I forced a check on all of the disks and rebooted. The problem appears to be gone now.
yea - I saw this issue while doing some troubleshooting and I think it's definitely related to netapp-c or it's network connection.  I took traces and they have a lot of diag data now.  It's a P1 case with them - hoping to have something back today.
Assignee: mrz → justin
fyi, also trying to find out if I can increase the iscsi timeout on ESX to help cope in the interim.
Updating summary
Summary: 4 machines start burning at same time on 1.8 waterfall page → Connection problems to netapp-c causing corruption of tinderbox file systems
netapp pointed us at a bug in 7.2.2 - upgraded to 7.2.4.  fixed.  Bug details:


Bug ID 	226424
Title 	WAFL mismanagement of network buffers results in poor filer performance.
Duplicate of 	 
Bug Severity	2 - System barely usable
Bug Status	Fixed
Product	Data ONTAP
Bug Type	WAFL 
Description
Formatted 	

 A filer may exhibit poor performance due to WAFL holding on to too many network
 buffers and not releasing them in a timely fashion. This situation can result
 in the unavailability of these buffers at the network interface causing the filer 
 to drop packets. Clients must then retransmit the packets until they are accepted
 causing slower access to the filer.
 

Workaround
Formatted 	

 

Notes
Formatted 	

 

Related Solutions	

 
Fixed-In Version	

    * Data ONTAP 7.2.4 (, GD) - Fixed
    * Data ONTAP 7.2.4L1 (GD) - Fixed
    * Data ONTAP 7.3RC1 (RC) - Fixed 

A complete list of releases where this bug is fixed is available here.
Related Bugs	145410, 225471 
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Sorry to say that we're still having problems with the tinderbox. 

For example tbnewref-win32-tbox on netapp-c-fcal1/bm-vmware06, which started burning at 2008/05/23 21:28 and today at 13:46. Both times were CVS conflicts when there were no checkins to cause them. After the first one I ran chkdsk on all the disks, updated vmware tools, and the next build was a clobber for the nightly; so either chkdsk sucks (plausible :-)) or there is repeated corruption occurring.

Also had a kernel panic on fx-linux-tbox (netapp-c-fcal1) sometime after 7:45 today. It got a fsck on each disk and a reboot, and still managed to burn around 11:30 after no checkins. Also had panics on l10n-linux-tbox (netapp-d-002) - will try to get a screen grab of the trace if it happens again (or we could add a serial port to the VM and do console logging).

On the Mozilla1.8, the balsa-18branch (netapp-c-001) and production-prometheus-vm machines (netapp-c-fcal1) are troubled.

Could you please take a look at the netapp logs to confirm the connections from the VM hosts are still being lost. Hopefully there will be something obvious there. If not, would we have to wait until Tuesday to get hold of Netapp and/or VMware ? We might be able to live with that given that most of the trees are effectively closed at the moment but a potential Firefox 3.0 RC2 looms early next week.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
seeing the reconnects only on netapp-c now, not on -d.  Talking with netapp...
Another angle of attack - we started having these problems on Tuesday

2008/05/20 17:52 PDT - balsa-18branch - false compile fail - http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211331120.1211331943.21904.gz

2008/05/21 02:17 PDT - crazyhorse - http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211361420.1211362134.7330.gz

Were there any infra changes around then ?
(In reply to comment #34)
> On the Mozilla1.8, the balsa-18branch (netapp-c-001) and
> production-prometheus-vm machines (netapp-c-fcal1) are troubled.
From looking on the logs, both of these machines are failing out just like they were earlier this week. For now, I'm going to try the same cleanup/restart as I did Wednesday night. But it looks like the upgrade didnt fix the problem.


> Could you please take a look at the netapp logs to confirm the connections from
> the VM hosts are still being lost. Hopefully there will be something obvious
> there. If not, would we have to wait until Tuesday to get hold of Netapp and/or
> VMware ? We might be able to live with that given that most of the trees are
> effectively closed at the moment but a potential Firefox 3.0 RC2 looms early
> next week.

Note: People are trying to land the handful of approved patches, in advance of possible FF3.0rc2 builds on Tuesday morning. What else can we do to raise the priority of this problem with netapp/vmware? 
On each of balsa-18branch, production-prometheus-vm, and production-pacifica-vm, I've done the same steps, like in comment#18:

- stop multi tinderbox
- renamed existing local cvs mozilla tree /builds/tinderbox/.../mozilla
to /builds/tinderbox/.../mozilla.bad, 
- start multi tinderbox

After restart, balsa-18branch, production-pacifica-vm are taking longer to fail out, which is I guess progress. 

After restart, production-prometheus-vm failed out, same as before. Found that the file /usr/include/bits/sigaction.h was completely corrupted, causing syntax errors whenever it was included in mozilla code. Replacing with undamaged file on staging-prometheus-vm, and looking for other corruptions.
balsa-18branch, production-pacifica-vm both now green. :-)

On production-prometheus-vm, in /usr/include/bits, I renamed the sigaction.h to sigaction.h.bad, and then copied over sigaction.h from staging-prometheus-vm. I then diff'd the files and saw they were both *fine*, and identical. Original file was no longer corrupted?!?! huh?! Lets see how the next build goes.
production-prometheus-vm now green.
mrz just pinged me on irc to say that fx-win32-tbox was offline. 

Found no tinderbox running, started it, but watched the build fail out with corrupt cvs files. Same problem again. Cleaned up local files, and restarted tinderbox. 
(In reply to comment #37)
> What else can we do to raise the priority of this problem with netapp/vmware? 
 
Nothing - we have P1 cases with both vendors, working through the weekend.  We'll update here when we have something.

fx-win32-tbox still building, but so far so good. 

balsa-18branch and production-prometheus-vm are burning again. After a couple
of cycles of running green, they hit the same vmware/netapp problem, and are
failing out just like before. :-(
Added console logging to a file on fx-linux-tbox after this box went read-only For some reason the messages from ext3-FS don't get into /var/log/messages and only appear in the console available thru the VI client.

Steps: 
1, in /boot/grub/grub.conf, append "console=ttyS0,115200n8 console=tty0" to the kernel line for 2.6.18-53.1.13.el5
2, Shutdown the VM
3, Edit Settings > Add > Serial Port > To File > [netapp-c-fcal1] fx-linux-tbox/console.log  (would vastly prefer /tmp on the VM host but couldn't figure out how to refer to that)
4, Restart VM

The log is at /vmfs/netapp-c-fcal1/fx-linux-tbox/console.log. There's no timestamps, looks like it's a kernel compile option :-(.
Summary: Connection problems to netapp-c causing corruption of tinderbox file systems → Connection problems to netapp's causing corruption of tinderbox file systems
Both fx-linux-tbox and fx-win32-tbox have been up and down all day, so I think the resets are getting worse. The windows box has taken to rebooting itself (or perhaps that's someone else). This is the tip of the iceberg, but these are two machines that we have to keep running.

I've given them one last go with a reboot, fsck/chkdsk all the partitions, remove objdir, & start tinderbox, but I suspect they'll die/burn/kill-a-kitten after as few as 3 builds. Am I missing something profoundly obvious in resuscitating these machines ? Perhaps disk checks are dangerous when connectivity is intermittent, but I've not seen any of the scsi errors while actually using a linux console.

Can we use the netapp/vmware logs to figure out if a particular box is causing this ? Or if one (or more) is broken and making it worse ? If not, it's time to close the Firefox tree.
(In reply to comment #36)
> Another angle of attack - we started having these problems on Tuesday
[snip]
> Were there any infra changes around then ?
> 
Really good question, Nick. I dont know of any changes.

Justin? mrz?
Status: REOPENED → ASSIGNED
nothing except the ontap upgrade.  still working with netapp/vmware - support is less than stellar given the holiday.  if there are a few critical boxes, they should be storage-vmotioned to netapp-d or -a.  note this will only work for a few as there is not a lot of spare storage.
(In reply to comment #45)
> Can we use the netapp/vmware logs to figure out if a particular box is causing
> this ? Or if one (or more) is broken and making it worse ? If not, it's time to
> close the Firefox tree.

On irc, Nick and I decided we might as well close the tree now. Machines start burning again as fast as we can repair them. Over the last few days, its been kinda impossible to find a non-burning time to land anyway, so closing tree seems academically honest. If things stabilize, we'll happily reopen. 

Note: this problem is impacting machines on Mozilla1.8, Mozilla1.9 and Mozilla2. I just closed Mozilla1.9 now. I would have had to close Mozilla1.8 and Mozilla2 also, but they were already closed for other reasons. 
so, think I have stabilized the situation for the moment.  we failed over to the secondary head after a lot of diagnosis, so all the arrays are being served off the one head, which is fine as load is not an issue.  we haven't seeing the resets/reconnects for over an hour (used to happen ever 10 min), so we'll monitor through the night.

action plan for tomorrow is to take a core of the head that was having issues, and diagnose once we are sure the issue is isolated to the single head.  in the mean time, we need normal production load thrown at the array, and I don't have any idea what state the machines are in.  

build - can you please bring up all the tinderboxen and monitor for crashes?  please list out which machines crash, if any do. 
still not a single error - need build to verify vm's are up and running with no issues.  please verify asap.
Assignee: justin → joduinn
Status: ASSIGNED → NEW
I'm now home, and will start looking through them all. If I can get them all up and running ok, I'll reopen the tree to see how it goes. Stay tuned...
balsa-18branch
production-prometheus-vm
production-pacifica-vm
fx-linux-tbox 

...were all burning because of the same disk corruption errors from above. I've now cleaned up the filesystems for these 4 VMs, and restarted the slaves. Lets see how these builds go.
These VMs are all now green, so I've reopened the tree. Lets see if the work in comment#49 holds up to the load. (John crosses his fingers)
Starting to bring up other VMs now.
Back in action:
l10n-win32-tbox (also updated vmware tools)
l10n-linux-tbox
tbnewref-win32-tbox
xr-win32-tbox

Checking disks now:
crazyhorse
karma
cerberus-vm
netapp ran clean all night - can you guys confirm no crashes?

now, next step in the investigation, I need to fail back to netapp-c (no downtime), wait for the issue, and get a core so they can figure out what is going on - then we'll go back to a steady state on -d while they analyze.  I know this is quite disruptive, so we can do this either today when one of you has some time to make sure things come back or on Tuesday - let me know what works best for your schedule and the build schedule.

sorry for the hassle here, but I think we are getting close to a resolution.
VMs stayed up and stable overnight, so lets leave well enough alone for now. 
Thanks for all the help over the holiday weekend!

Tomorrow (Tuesday) morning, we'll have a go/nogo decision for rc2. If we do not need rc2, we can move back to netapp-c immediately. If we do need rc2, we'll need to wait until after builds&updates are handed to QA before we can move back to netapp-c, so likely Thursday. 
Per Firefox3 meeting just now, this is on hold until after we produce FF3.0rc2 builds.
Priority: P1 → P3
I stumbled on this while looking for an update on FF3... My org went through a string of very similar issues last September on ESX 3.0.1/3.0.2 connecting to a NetApp.  It affected both us using both iSCSI and NFS from the ESX kernel and from within the VM's.  The problem with 3.0.2 was MUCH worse than with 3.0.1... we've rolled back to 3.0.1.  What version were you at pre-3.5?  I'm 99% sure this is going to turn out to be problem with VMware and/or other hardware, not the NetApp (we had the same issue with a CoRAID device and with a NetApp at the same time).  The May 8th ESX upgrade & May 9th error messages are most likely not a coincidence.

Be careful putting your investigations on hold for a long period of time with the VM's running.  Disk errors at the VM level _can_ end up corrupting the VM's OS--sometimes to the point where the OS can't boot anymore (particularly if you're dropping iSCSI connections altogether).

A few things to look for:

- Timing: clock drift, NTP problems, kerberos problems (can impact iSCSI connections), incorrect time/date etc. on VM's and on ESX servers.

- Network: 
  . any recent changes? upgrades? re-configurations? firewall policies?
  . switch/router health/load
  . MTU settings/capabilities
  . duplex settings
  . cables (don't overlook this)
  . trace routes from VMs to NetApp, from ESX to NetApp, etc.  Anything unusual?

- Web servers: incomplete/truncated downloads (particularly on larger files or when the network is under heavy load).

- DB Servers: check for anything unusual in the DB logs.

- ESX Servers:
  . try copying a large file from one of the iSCSI mount points to another location a few times (i.e., a large .vmdk file). Look for messages like "cp: reading `vm_name-flat.vmdk': Input/output error"
  . how and where are you mounting your iSCSI connections?  With VMware's software iSCSI Initiator?  hardware initiator?  Within VM's?

- Linux VMs:
  . filesystems unmounting and re-mounting as read-only?
  . files disappearing/re-appearing/changing sizes sporadically, particularly on network-based file systems (NFS in our case). 

- Windows VMs (Event Log Viewer):
  . symmpi:15:"The device, \Device\Scsi\symmpi1, is not ready for access yet." 
  . Disk:11:"The driver detected a controller error on \Device\Harddisk0."
  . Application Popup:333:"An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system's image of the Registry."
  . iScsiPrt:7:"The initiator could not send an iSCSI PDU. Error status is given in the dump data." (if you are making iSCSI connections from within a VM)
  . iScsiPrt:20:Connection to the target was lost.  The initiator will attempt to retry the connection.

Our attempted solutions:
- Rollback to 3.0.1
- Test/replace network cables
- Upgrade network equipment (jumbo frame support, dedicated gigabit ports)
- Completely isolate storage-related networks on their own dedicated switches (we're still in the process of doing this)
- Switch to fiber (ideally) or NFS from iSCSI where possible.

We're not really sure why the problem went away.  We fixed as many things as we could identify as possible causes.  The biggest impact we could observe was when we rolled back to ESX 3.0.1.

Hope this helps & sorry to read this.  I'm curious to see how things work out for you... Hopefully it won't be as painful an ordeal for you as it was for us.
(In reply to comment #59)
> Per Firefox3 meeting just now, this is on hold until after we produce FF3.0rc2
> builds.
> 
ok, builds and updates are now generated. Justin, sorry today's release took longer then expected, its now ok to try those NetApp diagnostic tests. 
Assignee: joduinn → justin
(In reply to comment #60)
hi Scott;

> I stumbled on this while looking for an update on FF3... My org went through a
> string of very similar issues...
[snip]
> Be careful putting your investigations on hold for a long period of time with
> the VM's running.  Disk errors at the VM level _can_ end up corrupting the VM's
> OS--sometimes to the point where the OS can't boot anymore (particularly if
> you're dropping iSCSI connections altogether).
Once we got things stable over weekend, we decided to hold off  investigations until this evening, because it was being so disruptive to VMs on tinderbox... hence disruptive to developers trying to land last few patches for FF3.0rc2.

Now that FF3.0rc2 is fully built and handed off to QA, we're back looking at this again. Its quite a worry.


> A few things to look for:
[snip]
> - Linux VMs:
>   . filesystems unmounting and re-mounting as read-only?
>   . files disappearing/re-appearing/changing sizes sporadically, particularly
> on network-based file systems (NFS in our case). 
We have been hit by this for a while now. The VMware response was to have us do kernel updates. See bug#407796 for details. That might or might have solved the intermittent problem, hard to tell with the recent mischief from NetApp.

[snip]
> Hope this helps & sorry to read this.  I'm curious to see how things work out
> for you... Hopefully it won't be as painful an ordeal for you as it was for us.
Your comments were very helpful, thanks for all the info. Sounds like an ugly experience, I also hope its not as painful for us!! 

tc
John.
Priority: P3 → P1
Justin, what's the current status on this ? Is netapp-c in normal mode or failed over to -d at the moment ? We're seeing VMs with problems over the last day or so, eg bug 407796 comment #86 and #87 but also some compile errors which look like corruption.
Escalated this more and got more information.  Netapp believes it's an issue with the LUN.  Current plan of action is to storage vmotion all VMs off a LUN (without downtime) to a another array we have, rebuild the LUN then move the VMs back.  Storage vmotion takes some time so this is a slow process.  If there are any VMs that could stand an hour of downtime, please let us know so we can move them faster.

Given the issue is with the LUN itself, netapp seems confident failing to netapp-d won't solve the issues (as we still saw them in the failed over state).  I'll ping you guys today in IRC to go over the plan more.
Approx 10.05 this morning, staging-pacifica-vm02 hit:

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process 

...which looks like it lost connection to staging-1.8-master. The next few attempts to build failed out with different errors.

By approx 10.30 this morning, staging-pacifica-vm02 was back running normally again. 

Are either of these VMs on the suspect LUN?
Yes.  Any VM with a datastore name containing "netapp" in it is.
Discovered qm-centos5-moz2-01 down just now, with a linux scsi kernel panic. Unclear how long its been down.

Justin restarted VM.
qm-win2k3-moz2-01 was broken, first with:

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process

...and then with file permission errors trying to delete files, causing all builds to fail. 

I've restarted the VM.
qm-win2k3-moz2-01 = netapp-c-fcal1, expected.  Thanks John.
Looks like prod-pacifica-vm is hitting this again?
prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp woes. 

I'll start to repair and restart these VMs right now, but unclear how long they will remain up before being hit by this problem again.
(In reply to comment #74)
> prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp
> woes. 
> 
> I'll start to repair and restart these VMs right now, but unclear how long they
> will remain up before being hit by this problem again.
> 

We are blocked on making progress on the core netapp issue as we can't move the last two machines on the shelf per build.  On hold till Monday.
prod-pacifica-vm and balsa-18branch now repaired and restarted. They're past where they were failing before, and I hope are now ok. I'll keep watching over the next few hours.

crazyhorse is still going through fsck problems with corrupted superblocks.
(In reply to comment #75)
> (In reply to comment #74)
> > prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp
> > woes. 
> > 
> > I'll start to repair and restart these VMs right now, but unclear how long they
> > will remain up before being hit by this problem again.
> > 
> 
> We are blocked on making progress on the core netapp issue as we can't move the
> last two machines on the shelf per build.  On hold till Monday.

Errr... I thought the only VMs not being moved where the buildbot masters controlling all the talos and unittest machines. 

What happened to the other VMs thats were in transition Friday - are they now moved? And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm and balsa-18branch) on - and can these be moved?
balsa-18branch failed out again. It made it further then comment#74, but failed out with similar errors. While repairing again, I hit this weirdness:


$ pwd
/builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend
$ mv mozilla mozilla.bad.002 
$ mkdir mozilla       
mkdir: cannot create directory `mozilla': File exists
$
$ ls -la 
...
drwxrwxrwx   11 cltbld   users        4096 Jun  7 19:38 mozilla
drwxrwxrwx   53 cltbld   users        4096 Jun  7 18:46 mozilla.bad.001
drwxr-xr-x   58 cltbld   users        4096 Jun  7 19:38 mozilla.bad.002
...

After 30+ seconds the "mozilla" directory disappeared?!?! Since when are "mv" operations not atomic?


I've now re-rebooted balsa-18branch and re-restarted tinderbox on it. Fingers crossed.
crazyhorse is now up and building. It took a few fsck-with-repairs and reboot cycles by Trevor, and also two fsck-with-repairs and reboot cycles by me. 

Build started, lets see how it goes. Fingers crossed.
> Errr... I thought the only VMs not being moved where the buildbot masters
> controlling all the talos and unittest machines. 
> 
> What happened to the other VMs thats were in transition Friday - are they now
> moved? And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm
> and balsa-18branch) on - and can these be moved?

They were all done Saturday morning - just waiting on those last two to re-build the LUN. 

production-prometheus-vm.build.m.o started failing out at 00:07 with file access problems and then cvs refresh problems, one of the common symptoms we've seen recently with this NetApp issue.

Without any action from me, production-prometheus-vm is now passing the point where it would usually fail. It looks like it might have self-recovered somehow, so I'm not going to kill/restart it.
balsa-18branch, crazyhorse, production-prometheus-vm, and prod-pacifica-vm are all red right now. Three of those build nightlies.
(In reply to comment #80)
> > Errr... I thought the only VMs not being moved where the buildbot masters
> > controlling all the talos and unittest machines. 
> > 
> > What happened to the other VMs thats were in transition Friday - are they now
> > moved? 
> They were all done Saturday morning - just waiting on those last two to
> re-build the LUN. 
Great. 

> > And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm
> > and balsa-18branch) on - and can these be moved?
These failed again. Question remains... can these be moved also or do we have to leave them where they are until after LUN repair?
(In reply to comment #83)
> These failed again. Question remains... can these be moved also or do we have
> to leave them where they are until after LUN repair?

If that question is for me, my question back would be: How long of a down time are we talking about to move them? If it's less than six or so hours, move them now so we can get nightlies tomorrow and hopefully give a go to release.
(In reply to comment #85)
> *** Bug 437893 has been marked as a duplicate of this bug. ***
Loss of qm-rhel02 in bug#437893 caused Talos machines to go offline, and close moz2 and mozilla1.9 branches. 


(In reply to comment #84)
> (In reply to comment #83)
> > These failed again. Question remains... can these be moved also or do we have
> > to leave them where they are until after LUN repair?
> 
> If that question is for me, my question back would be: How long of a down time
> are we talking about to move them? If it's less than six or so hours, move them
> now so we can get nightlies tomorrow and hopefully give a go to release.
Question was for Justin. 

I've since talked to Justin on phone. Once he's finished moving the masters (above), he'll start moving these slaves also. Because of the sick LUN, there are limits on how many VMs he can migrate at a time. Justin will update this bug with info as the VMs are migrated.
Specifically qm-rhel02 and production-master are both now down, so they can be moved to healthy NetApp.
both of these have been moved - rebuilding the shelf now.  After this, one more to go.
Fixed up balsa-18branch this morning, crazyhorse has a fsck in progress. 

Attempted to move production-pacifica-vm to d-sata-build-003 but it failed with "The virtual disk is either corrupted or not a supported format." Ran chkdsk and got the same error again after getting to a few tens of percent. Is this really a network error during the copy ?

Also attempted to "Complete migration" on production-prometheus-vm, it failed out in with similar messages to Administrator's attempts (=justin?).
Blocks: 437257
We just powered off all VMs on netapp-c-fcal, see list below. The trees are closed anyway because of failing VMs, so this seems worth trying. In theory this should make it faster to move them off this LUN. 

 bm-symbolfetch01
 bm-win2k3-pgo01
 build-console
 fx-linux-tbox
 fx-win32-tbox
 l10n-win32-tbox
 mobile-linux-slave1
 moz2-master
 moz2-win32-slave1
 production-pacifica-vm
 production-prometheus-vm
 production-trunk-automation
 qm-centos5-moz2-01
 qm-moz2-unittest01
 qm-win2k3-moz2-01
 qm-win2k3-stage-pgo01
 staging-build-console
 staging-pacifica-vm
 staging-trunk-automation
 tbnewref-win32-tbox
 try2-linux-slave
 try2-win32-slave 
Justin and mrz think they've nailed it (as of 11pm-ish). A few remaining VMs still need to be moved from netapp-c-fcal1:
l10n-win32-tbox
moz2-win32-slave1
qm-win2k3-stage-pgo01
staging-pacifica-vm
staging-1.9-master
tbnewref-win32-tbox
try2-linux-slave
try2-win32-slave

There are known issues with karma and prometheus that need to be fixed (from irc with justin).

The production-1.9-master and fx-linux-1.9-slave2 VMs are both confirmed ok, and now in use.
We'll now start bringing back up all the various build/unittest/talos machines, repairing corrupted file-systems as needed. Depending on how many VMs need how much repair, we expect to start reopening the 1.8/1.9/moz2 trees sometime Tuesday. 

Watch this space.
(In reply to comment #91)
> Justin and mrz think they've nailed it (as of 11pm-ish). A few remaining VMs
> still need to be moved from netapp-c-fcal1:
> l10n-win32-tbox
> moz2-win32-slave1
> qm-win2k3-stage-pgo01
> staging-pacifica-vm
> staging-1.9-master
> tbnewref-win32-tbox
> try2-linux-slave
> try2-win32-slave
> 

Looks like these have all been moved. l10n-win32-tbox is still shutdown. Can we start it again?
tb-linux-tbox is in a kernel panic right now. It's still at least partly on netapp-c-001 so I'll wait for an OK before restarting it.
The only remaining misconfigured LUN is netapp-c-fcal1.  netapp-c-001 should be fine.

Oh, and it's only on netapp-c-001 in the sense that it's cdrom drive is some ISO image off that datastore which I can't seem to change while the host is running (or perhaps because vmware tools isn't yet running).
At this point, all machines/VMs are back up, and staying up, *except*:

1.9 and moz2
============
qm-rhel02 (runs talos, unittest)

1.8
===
crazyhorse
karma
production-prometheus-vm

We're making progress, but will continue to keep all these trees closed for now. As we get VMs repaired, we'll update this bug. 
We seem to have sufficient box coverage on the Mozilla2 tinderbox now. Do we need to keep the tree closed?
It is of a lesser priority since it is a staging machine, but I'm still missing qm-buildbot01.
(In reply to comment #96)
> At this point, all machines/VMs are back up, and staying up, *except*:
> 
> 1.9 and moz2
> ============
> qm-rhel02 (runs talos, unittest)

...has been working without going read-only since mid morning. Looks like the
recurring read-only-ness might be resolved also by last nights fix? We've let a few unittest and talos runs go through to verify that all are connected and working again. It all looks good, so I've now reopened the tree for mozilla-central.


Work continues on reparing the VMs needed for 1.8.
(In reply to comment #98)
> It is of a lesser priority since it is a staging machine, but I'm still missing
> qm-buildbot01.

Totally left off the RADAR - it's on a classic QA ESX server.  Might as well take the opportunity to move the VM over to Build land.
I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's misbehaving. I was unable to resurrect it after the event this morning.
(In reply to comment #100)
> (In reply to comment #98)
> > It is of a lesser priority since it is a staging machine, but I'm still missing
> > qm-buildbot01.
> 
> Totally left off the RADAR - it's on a classic QA ESX server.  Might as well
> take the opportunity to move the VM over to Build land.

The lives on bm-vmware02 now and it's up.
(In reply to comment #96)
> 1.8
> ===
> crazyhorse
tracked in bug#437798

> karma
> production-prometheus-vm
tracked in bug#438386


(In reply to comment #101)
> I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's
> misbehaving. I was unable to resurrect it after the event this morning.
Is qm-centos5-01 still a problem? Can it be repaired or do we need to create new clone?
(In reply to comment #103)
> (In reply to comment #101)
> > I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's
> > misbehaving. I was unable to resurrect it after the event this morning.
> Is qm-centos5-01 still a problem? Can it be repaired or do we need to create
> new clone?

Based on my efforts with it yesterday, I think it will need to be rebuilt. Maybe mrz can resurrect it where I have failed though.
qm-vista02 on qm-vmware01 is dead in the water. Unable to power it on or access it via the VMWare console or RDP.
Does it say (inaccessible) after it in Virtual Infrastructure client?  Try to remove it from inventory, rename the directory the VM files are in, and manually re-add it to the inventory.  If it won't re-add, see if you can recover or move the VM's folder to another location and re-add it from a different ESX server/cluster.  You're probably in better shape if it won't even start up than you are if the VM starts with BIOS or OS errors about the disk.
https://bugzilla.mozilla.org/show_bug.cgi?id=438664

looking to re-image qm-centos5-02 as it is being erratic and not building well at all.
(In reply to comment #105)
> qm-vista02 on qm-vmware01 is dead in the water. Unable to power it on or access
> it via the VMWare console or RDP.

thought that was in another bug - should be up.
No longer blocks: 438664
Depends on: 438664
Depends on: 437798
karma is up and running again
From bugs and irc, heres the list of VMs reported as problems and being repaired today. Most are now fixed, and confirmed working ok. 


[ ] crazyhorse
[+] karma
[+] production-prometheus-vm
[+] qm-buildbot01
[+] qm-vista02
[+] qm-centos5-01
[ ] qm-centos5-02

We just have qm-centos5-02 and crazyhorse being worked on.
crazyhorse is working.

[+] crazyhorse
[+] karma
[+] production-prometheus-vm
[+] qm-buildbot01
[+] qm-vista02
[+] qm-centos5-01
[ ] qm-centos5-02

qm-centos5-02 is still being worked on, it looks like an application problem, but still investigating.
qm-centos5-02 is working.

[+] crazyhorse
[+] karma
[+] production-prometheus-vm
[+] qm-buildbot01
[+] qm-vista02
[+] qm-centos5-01
[+] qm-centos5-02

Closing!
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
For the past ~24 hours, there seem to have been issues with qm-win2k3-pgo01 (bug 440531) and qm-xserve01 (bug 440536).

I just filed those two bugs for these issues, but I'm mentioning it here as well, per this message on the Tinderbox page:
> ... but if you see weirdness, not caused by checkins
> please add details to bug bug 435134 .

xserve01 is a physical machine, so I doubt that one is related...
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.