Closed
Bug 435134
Opened 16 years ago
Closed 16 years ago
Connection problems to netapp's causing corruption of tinderbox file systems
Categories
(mozilla.org Graveyard :: Server Operations, task, P1)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: justin)
References
Details
Attachments
(2 files)
On http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla1.8, the following machines are burning. Interesting they all started burning around 05:57am-06:10am this morning. Even though they all give different errors, the timing makes me wonder if these machines are failing for a related reason? balsa-18branch http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211416560.1211416689.25547.gz&fulltext=1 ends with: cvs checkout: cannot open directory /cvsroot/:ext:ffxbld@cvs.mozilla.org:/cvs: No such file or directory cvs checkout: skipping directory mozilla/extensions/webdav/tests crazyhorse http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211415900.1211416021.23668.gz&fulltext=1 ends with: cc1plus: internal compiler error: Segmentation fault prod-pacifica-vm http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211415600.1211416259.24280.gz&fulltext=1 ends with: ../../dist/lib/gkxtfbase_s.lib : fatal error LNK1136: invalid or corrupt file
Reporter | ||
Updated•16 years ago
|
Severity: normal → blocker
Priority: -- → P1
Comment 1•16 years ago
|
||
To add to the fun of "random" failures, crazyhorse actually had five failures earlier, from 02:17 to 03:03, claiming rather massive CVS troubles with its tree, though the nightly clobber cleared it up, and prod-pacifica-vm failed on its first try at a nightly, dying with "make[5]: *** read jobs pipe: No such file or directory. Stop.", either or both of which may or may not be connected.
Reporter | ||
Comment 2•16 years ago
|
||
production-prometheus-vm just started burning with the following error: lib/libxulapp_s.a -L../../dist/lib -lmozpng -L../../dist/lib -lmozjpeg -L../../dist/lib -lmozz -L-L../../dist/bin -L../../dist/lib -lcrmf -lsmime3 -lssl3 -lnss3 -lsoftokn3 -lmozcairo -lmozlibpixman -L/usr/X11R6/lib -lXrender -lX11 -lfontconfig -lfreetype -L/usr/X11R6/lib -lXt -L/usr/X11R6/lib -lXft -lX11 -lfreetype -lXrender -lfontconfig -L../../dist/lib -lxpcom_compat ../../dist/lib/components/libgklayout.a: could not read symbols: Memory exhausted collect2: ld returned 1 exit status
Summary: 3 machines start burning at same time on 1.8 waterfall page → 4 machines start burning at same time on 1.8 waterfall page
Reporter | ||
Comment 3•16 years ago
|
||
I note there have been no checkins for days on that branch, so not clear what is causing all these machines to start burning now.
Reporter | ||
Comment 4•16 years ago
|
||
Argh... and a bunch of talos machines off also? Is this bug a dup of bug#435052?
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → justin
Updated•16 years ago
|
Assignee: server-ops → mrz
Comment 5•16 years ago
|
||
Not entirely sure what the action to fix on this is. I don't see anything network-wise to indicate anything. What else would cause this error? cvs checkout: cannot open directory /cvsroot/:ext:ffxbld@cvs.mozilla.org:/cvs: No such file or directory cvs checkout: skipping directory mozilla/extensions/webdav/test Does that mean cvs.mozilla.org couldn't find the file to serve?
Assignee | ||
Comment 6•16 years ago
|
||
I see nothing in these errors to indicate it has anything to do with the network. Bug 435052 on the other had clearly points to a possible network hiccup, but totally different times and unreleated (I think). Can this go back to releng due to the lack of anything pointing at the network layer?
Reporter | ||
Comment 7•16 years ago
|
||
Note: I've tried restarting tinderbox on crazyhorse around 8pm. Its done a few builds at this point, and is still burning.
Assignee | ||
Comment 8•16 years ago
|
||
kicking back to releng for troubleshooting. feel free to pull us in if you have any evidence this might be infra related.
Assignee: mrz → joduinn
Component: Server Operations → Release Engineering
Assignee | ||
Comment 9•16 years ago
|
||
so I may be wrong here - per John, all of these are vm's with network storage and all had issues at the same time. given that, this may very well be a network issue if the storage fell out from under the host. still grasping for a reason or direction to diagnose though.
Assignee: joduinn → mrz
Component: Release Engineering → Server Operations
Comment 10•16 years ago
|
||
The cvs error is the only one that stands out as not local the the host - the others could all be SAN related (corrupt file, memory exhaustion (if it's the page file?). But I don't see anything on any switches or any switch ports going to any of the filers that show anything alarming. Still looking...
Comment 11•16 years ago
|
||
fyi, all of these VMs are on netapp-c.
Reporter | ||
Comment 12•16 years ago
|
||
I've just rebooted crazyhorse VM. Its booting and rechecking disks (very slow!). Maybe this reboot and a clobber build will do the trick?
Reporter | ||
Comment 13•16 years ago
|
||
mrz just notified me that fx-linux-tbox is also hung, trying to do a build since 16:04 this afternoon.
Comment 14•16 years ago
|
||
Dunno if it's related or not, but qm-centos5-01, qm-centos5-03, qm-xserve01, and (a bit less likely) qm-win2k3-pgo01 are failing in what doesn't look to me like a familiar way - the Linux and Mac ones since late morning, the Win-pgo... hard to tell, since it takes forever, and seems to have done one burning build and one "green" with no results build today.
Reporter | ||
Comment 15•16 years ago
|
||
(In reply to comment #12) > I've just rebooted crazyhorse VM. Its booting and rechecking disks (very > slow!). Maybe this reboot and a clobber build will do the trick? > Justin pushed crazyhorse through two unhappy fsck's while I was driving home. Crazyhorse is now cleanly rebooted. I've restarted tinderbox, but dont have privs to trigger clobber build. Lets see if depend build goes ok on crazyhorse.
Reporter | ||
Comment 16•16 years ago
|
||
(In reply to comment #13) > mrz just notified me that fx-linux-tbox is also hung, trying to do a build > since 16:04 this afternoon. > After mrz killed two "Bug Buddy" instance, and rebooted the VM, now fx-linux-tbox is green again.
Reporter | ||
Comment 17•16 years ago
|
||
(In reply to comment #15) > (In reply to comment #12) > > I've just rebooted crazyhorse VM. Its booting and rechecking disks (very > > slow!). Maybe this reboot and a clobber build will do the trick? > > > Justin pushed crazyhorse through two unhappy fsck's while I was driving home. > Crazyhorse is now cleanly rebooted. I've restarted tinderbox, but dont have > privs to trigger clobber build. Lets see if depend build goes ok on crazyhorse. > Success - crazyhorse just turned green!
Reporter | ||
Comment 18•16 years ago
|
||
After the initial errors balsa-18branch hit in comment#0, balsa-branch continues to fail, but now fails with: C mozilla/intl/unicharutil/util/.cvsignore C mozilla/intl/unicharutil/util/Makefile.in C mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp C mozilla/intl/unicharutil/util/nsCompressedCharMap.h C mozilla/intl/unicharutil/util/nsUnicharUtils.cpp C mozilla/intl/unicharutil/util/nsUnicharUtils.h cvs [checkout aborted]: /case.dat/1.1/Fri Jan 8 00:19:24 199/CVSROOT: No such file or directory gmake: *** Conflicts during checkout. C mozilla/intl/unicharutil/util/.cvsignore C mozilla/intl/unicharutil/util/Makefile.in C mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp C mozilla/intl/unicharutil/util/nsCompressedCharMap.h C mozilla/intl/unicharutil/util/nsUnicharUtils.cpp C mozilla/intl/unicharutil/util/nsUnicharUtils.h gmake[1]: *** [real_checkout] Error 1 gmake[1]: Leaving directory `/builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend' gmake: *** [checkout] Error 2 Error: CVS checkout failed. I did the following: - stop multi tinderbox - renamed /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend/mozilla to /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend/mozilla.bad, - start multi tinderbox ...and its taking longer to fail out, which is I guess progress. Stay tuned.
Reporter | ||
Comment 19•16 years ago
|
||
Success - balsa-18branch is now green.
Comment 20•16 years ago
|
||
Here are the scsi errors from the system log of production-premetheus-vm, with the dhclient lines removed. Running fsck on it now, which is finding lots of things to fix.
Comment 21•16 years ago
|
||
I think these machines are all fixed up now, even if we don't know the cause. Some details: production-prometheus-vm (netapp-c-fcal1/bm-vmware11, vmware tools out of date) * required a bunch of fsck runs and reboots to get the partitions fixed up, and vmware tools updated production-pacifica-vm (netapp-c-fcal1/bm-vmware11, vmware tools out of date) * used |chkdsk C:\ /F| to get a check on reboot, no errors found balsa-18branch (netapp-c-001/bm-vmware07, vmware tools out of date) * update vmware tools (switched from rpm to tar.gz install), the reboot from that started running fsck for days-since-last-check * when restarting this box, you have to open a console using the VI client and run ~/start-X-repeatedly.sh (otherwise it fails the tinderbox tests) crazyhorse (netapp-c-001/bm-vmware09, vmware tools out of date) * updated vmware tools automatically, rebooted We also have a problem with: moz2-win32-slave (netapp-c-fcal1/bm-vmware06, vmware tools out of date) * checked C,D, and E disks * updates vmware tools
Comment 22•16 years ago
|
||
(In reply to comment #21) > We also have a problem with: This should say "had".
Comment 23•16 years ago
|
||
The log isn't super verbose and this is mainly here for timing. The disk error is "The driver detected a controller error on \Device\Harddisk0" The symmpi one is "The device, \Device\Scsi\symmpi1, is not ready for access yet"
Comment 24•16 years ago
|
||
Random guessing time -
Could be a transient network fault, as we guessed yesterday. Or it could be something in the netapp itself. The machines here all seem to be on netapp-c-001 and netapp-c-fcal1 - can I read that as different shelves on the netapp-c backplane/device/host ? Is there anything in the logs for that netapp which indicates a problem ? (eg did some drives die and it's coping best it can ?)
I had a quick look at /var/log/messages on linux boxes using netapp-d-001 (l10n-linux-tbox, prometheus-vm, moz2-linux-slave1) and don't find any scsi errors like attachment 322081 [details]. Do seem them for karma, which is on netapp-c-001 and had a couple of red builds yesterday.
There are lots of scsi messages in bm-vmware09:/var/log/{vmkernel,vmkwarning}, which seem to start on May 9 - possible fallout from the ESX3.5 upgrade on May8 or just a change in logging messages ?
The VI client reports some inaccessible VMs on bm-vmware08:
* bm-symbolfetch01 (netapp-c-fcal1) - should have been on
* try2-linux-slave, try2-win32-slave (netapp-c-fcal1) - should be on, can see other try VM's so probably not a network sandbox thing
* backup-pacifica-vm, backup-prometheus-vm (netapp-c-001) - would have been off
* tb-win32-tbox (netapp-c-001) - would have been off
Let us know if there's anything we can do to help diagnose.
Assignee | ||
Comment 25•16 years ago
|
||
Really helpful Nick - thanks. All signs point to a network issue, as we saw this + the disconnects in the morning. The netapp is just exposing a block device so any corruption would most likely happen as a result of the host loosing network access to the netapp, and having issues. I'll talk with mrz more tomorrow morning and dig into the netapp logs to see if I can find anything else...
Assignee | ||
Comment 26•16 years ago
|
||
from the netapp logs - starting on may 20th, we start seeing a lot of reconnects and Abort tasts from the vmware hosts: Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware09) sent LUN Reset request, aborting all SCSI commands on lun 2 Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware05) sent LUN Reset request, aborting all SCSI commands on lun 1 Tue May 20 22:05:46 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:bm-vmware07) sent LUN Reset request, aborting all SCSI commands on lun 1 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:pm-vmware01 at IP addr 10.253.0.92 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware10 at IP addr 10.253.0.229 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware03 at IP addr 10.253.0.223 Wed May 21 03:58:42 GMT [mpt-netapp-c: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1998-01.com.vmware:bm-vmware06 at IP addr 10.253.0.237
Assignee | ||
Comment 27•16 years ago
|
||
Netapp P1 Case #3101612 opened...
Comment 28•16 years ago
|
||
We had some more issues with machines on netapp-c today, slightly different symptoms but most likely related to this (I'm just making a note of it really, while we wait for Netapp to respond). fx-linux-tbox - tinderbox crashed out at some point after 20:15 PDT yesterday. The symlink from /bash/sh to bash was broken, but was fixed after a reboot (no disk check was done). balsa-18branch - stopped cycling at 20:25 PDT yesterday, multi-tinderbox was running but wasn't doing anything. Nothing in /var/log/messages since May 18 (odd given the reboots it had thursday). Rebooted it, it didn't do a disk check, started tinderbox, cycled OK. karma - CVS conflicts from 20:09 PDT. Rebooted, fsck triggered, still in progress ...
Comment 29•16 years ago
|
||
sm-try2-win32-slave has had some trouble too. Lots of errors like this in the system log: The driver detected a controller error on \Device\Harddisk0. I forced a check on all of the disks and rebooted. The problem appears to be gone now.
Assignee | ||
Comment 30•16 years ago
|
||
yea - I saw this issue while doing some troubleshooting and I think it's definitely related to netapp-c or it's network connection. I took traces and they have a lot of diag data now. It's a P1 case with them - hoping to have something back today.
Assignee: mrz → justin
Comment 31•16 years ago
|
||
fyi, also trying to find out if I can increase the iscsi timeout on ESX to help cope in the interim.
Comment 32•16 years ago
|
||
Updating summary
Summary: 4 machines start burning at same time on 1.8 waterfall page → Connection problems to netapp-c causing corruption of tinderbox file systems
Assignee | ||
Comment 33•16 years ago
|
||
netapp pointed us at a bug in 7.2.2 - upgraded to 7.2.4. fixed. Bug details: Bug ID 226424 Title WAFL mismanagement of network buffers results in poor filer performance. Duplicate of Bug Severity 2 - System barely usable Bug Status Fixed Product Data ONTAP Bug Type WAFL Description Formatted A filer may exhibit poor performance due to WAFL holding on to too many network buffers and not releasing them in a timely fashion. This situation can result in the unavailability of these buffers at the network interface causing the filer to drop packets. Clients must then retransmit the packets until they are accepted causing slower access to the filer. Workaround Formatted Notes Formatted Related Solutions Fixed-In Version * Data ONTAP 7.2.4 (, GD) - Fixed * Data ONTAP 7.2.4L1 (GD) - Fixed * Data ONTAP 7.3RC1 (RC) - Fixed A complete list of releases where this bug is fixed is available here. Related Bugs 145410, 225471
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 34•16 years ago
|
||
Sorry to say that we're still having problems with the tinderbox. For example tbnewref-win32-tbox on netapp-c-fcal1/bm-vmware06, which started burning at 2008/05/23 21:28 and today at 13:46. Both times were CVS conflicts when there were no checkins to cause them. After the first one I ran chkdsk on all the disks, updated vmware tools, and the next build was a clobber for the nightly; so either chkdsk sucks (plausible :-)) or there is repeated corruption occurring. Also had a kernel panic on fx-linux-tbox (netapp-c-fcal1) sometime after 7:45 today. It got a fsck on each disk and a reboot, and still managed to burn around 11:30 after no checkins. Also had panics on l10n-linux-tbox (netapp-d-002) - will try to get a screen grab of the trace if it happens again (or we could add a serial port to the VM and do console logging). On the Mozilla1.8, the balsa-18branch (netapp-c-001) and production-prometheus-vm machines (netapp-c-fcal1) are troubled. Could you please take a look at the netapp logs to confirm the connections from the VM hosts are still being lost. Hopefully there will be something obvious there. If not, would we have to wait until Tuesday to get hold of Netapp and/or VMware ? We might be able to live with that given that most of the trees are effectively closed at the moment but a potential Firefox 3.0 RC2 looms early next week.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 35•16 years ago
|
||
seeing the reconnects only on netapp-c now, not on -d. Talking with netapp...
Comment 36•16 years ago
|
||
Another angle of attack - we started having these problems on Tuesday 2008/05/20 17:52 PDT - balsa-18branch - false compile fail - http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211331120.1211331943.21904.gz 2008/05/21 02:17 PDT - crazyhorse - http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1211361420.1211362134.7330.gz Were there any infra changes around then ?
Reporter | ||
Comment 37•16 years ago
|
||
(In reply to comment #34) > On the Mozilla1.8, the balsa-18branch (netapp-c-001) and > production-prometheus-vm machines (netapp-c-fcal1) are troubled. From looking on the logs, both of these machines are failing out just like they were earlier this week. For now, I'm going to try the same cleanup/restart as I did Wednesday night. But it looks like the upgrade didnt fix the problem. > Could you please take a look at the netapp logs to confirm the connections from > the VM hosts are still being lost. Hopefully there will be something obvious > there. If not, would we have to wait until Tuesday to get hold of Netapp and/or > VMware ? We might be able to live with that given that most of the trees are > effectively closed at the moment but a potential Firefox 3.0 RC2 looms early > next week. Note: People are trying to land the handful of approved patches, in advance of possible FF3.0rc2 builds on Tuesday morning. What else can we do to raise the priority of this problem with netapp/vmware?
Reporter | ||
Comment 38•16 years ago
|
||
On each of balsa-18branch, production-prometheus-vm, and production-pacifica-vm, I've done the same steps, like in comment#18: - stop multi tinderbox - renamed existing local cvs mozilla tree /builds/tinderbox/.../mozilla to /builds/tinderbox/.../mozilla.bad, - start multi tinderbox After restart, balsa-18branch, production-pacifica-vm are taking longer to fail out, which is I guess progress. After restart, production-prometheus-vm failed out, same as before. Found that the file /usr/include/bits/sigaction.h was completely corrupted, causing syntax errors whenever it was included in mozilla code. Replacing with undamaged file on staging-prometheus-vm, and looking for other corruptions.
Reporter | ||
Comment 39•16 years ago
|
||
balsa-18branch, production-pacifica-vm both now green. :-) On production-prometheus-vm, in /usr/include/bits, I renamed the sigaction.h to sigaction.h.bad, and then copied over sigaction.h from staging-prometheus-vm. I then diff'd the files and saw they were both *fine*, and identical. Original file was no longer corrupted?!?! huh?! Lets see how the next build goes.
Reporter | ||
Comment 40•16 years ago
|
||
production-prometheus-vm now green.
Reporter | ||
Comment 41•16 years ago
|
||
mrz just pinged me on irc to say that fx-win32-tbox was offline. Found no tinderbox running, started it, but watched the build fail out with corrupt cvs files. Same problem again. Cleaned up local files, and restarted tinderbox.
Assignee | ||
Comment 42•16 years ago
|
||
(In reply to comment #37) > What else can we do to raise the priority of this problem with netapp/vmware? Nothing - we have P1 cases with both vendors, working through the weekend. We'll update here when we have something.
Reporter | ||
Comment 43•16 years ago
|
||
fx-win32-tbox still building, but so far so good. balsa-18branch and production-prometheus-vm are burning again. After a couple of cycles of running green, they hit the same vmware/netapp problem, and are failing out just like before. :-(
Comment 44•16 years ago
|
||
Added console logging to a file on fx-linux-tbox after this box went read-only For some reason the messages from ext3-FS don't get into /var/log/messages and only appear in the console available thru the VI client. Steps: 1, in /boot/grub/grub.conf, append "console=ttyS0,115200n8 console=tty0" to the kernel line for 2.6.18-53.1.13.el5 2, Shutdown the VM 3, Edit Settings > Add > Serial Port > To File > [netapp-c-fcal1] fx-linux-tbox/console.log (would vastly prefer /tmp on the VM host but couldn't figure out how to refer to that) 4, Restart VM The log is at /vmfs/netapp-c-fcal1/fx-linux-tbox/console.log. There's no timestamps, looks like it's a kernel compile option :-(.
Summary: Connection problems to netapp-c causing corruption of tinderbox file systems → Connection problems to netapp's causing corruption of tinderbox file systems
Comment 45•16 years ago
|
||
Both fx-linux-tbox and fx-win32-tbox have been up and down all day, so I think the resets are getting worse. The windows box has taken to rebooting itself (or perhaps that's someone else). This is the tip of the iceberg, but these are two machines that we have to keep running. I've given them one last go with a reboot, fsck/chkdsk all the partitions, remove objdir, & start tinderbox, but I suspect they'll die/burn/kill-a-kitten after as few as 3 builds. Am I missing something profoundly obvious in resuscitating these machines ? Perhaps disk checks are dangerous when connectivity is intermittent, but I've not seen any of the scsi errors while actually using a linux console. Can we use the netapp/vmware logs to figure out if a particular box is causing this ? Or if one (or more) is broken and making it worse ? If not, it's time to close the Firefox tree.
Reporter | ||
Comment 46•16 years ago
|
||
(In reply to comment #36) > Another angle of attack - we started having these problems on Tuesday [snip] > Were there any infra changes around then ? > Really good question, Nick. I dont know of any changes. Justin? mrz?
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 47•16 years ago
|
||
nothing except the ontap upgrade. still working with netapp/vmware - support is less than stellar given the holiday. if there are a few critical boxes, they should be storage-vmotioned to netapp-d or -a. note this will only work for a few as there is not a lot of spare storage.
Reporter | ||
Comment 48•16 years ago
|
||
(In reply to comment #45) > Can we use the netapp/vmware logs to figure out if a particular box is causing > this ? Or if one (or more) is broken and making it worse ? If not, it's time to > close the Firefox tree. On irc, Nick and I decided we might as well close the tree now. Machines start burning again as fast as we can repair them. Over the last few days, its been kinda impossible to find a non-burning time to land anyway, so closing tree seems academically honest. If things stabilize, we'll happily reopen. Note: this problem is impacting machines on Mozilla1.8, Mozilla1.9 and Mozilla2. I just closed Mozilla1.9 now. I would have had to close Mozilla1.8 and Mozilla2 also, but they were already closed for other reasons.
Assignee | ||
Comment 49•16 years ago
|
||
so, think I have stabilized the situation for the moment. we failed over to the secondary head after a lot of diagnosis, so all the arrays are being served off the one head, which is fine as load is not an issue. we haven't seeing the resets/reconnects for over an hour (used to happen ever 10 min), so we'll monitor through the night. action plan for tomorrow is to take a core of the head that was having issues, and diagnose once we are sure the issue is isolated to the single head. in the mean time, we need normal production load thrown at the array, and I don't have any idea what state the machines are in. build - can you please bring up all the tinderboxen and monitor for crashes? please list out which machines crash, if any do.
Assignee | ||
Comment 50•16 years ago
|
||
still not a single error - need build to verify vm's are up and running with no issues. please verify asap.
Assignee: justin → joduinn
Status: ASSIGNED → NEW
Reporter | ||
Comment 51•16 years ago
|
||
I'm now home, and will start looking through them all. If I can get them all up and running ok, I'll reopen the tree to see how it goes. Stay tuned...
Reporter | ||
Comment 52•16 years ago
|
||
balsa-18branch production-prometheus-vm production-pacifica-vm fx-linux-tbox ...were all burning because of the same disk corruption errors from above. I've now cleaned up the filesystems for these 4 VMs, and restarted the slaves. Lets see how these builds go.
Reporter | ||
Comment 53•16 years ago
|
||
These VMs are all now green, so I've reopened the tree. Lets see if the work in comment#49 holds up to the load. (John crosses his fingers)
Comment 54•16 years ago
|
||
Starting to bring up other VMs now.
Comment 56•16 years ago
|
||
Back in action: l10n-win32-tbox (also updated vmware tools) l10n-linux-tbox tbnewref-win32-tbox xr-win32-tbox Checking disks now: crazyhorse karma cerberus-vm
Assignee | ||
Comment 57•16 years ago
|
||
netapp ran clean all night - can you guys confirm no crashes? now, next step in the investigation, I need to fail back to netapp-c (no downtime), wait for the issue, and get a core so they can figure out what is going on - then we'll go back to a steady state on -d while they analyze. I know this is quite disruptive, so we can do this either today when one of you has some time to make sure things come back or on Tuesday - let me know what works best for your schedule and the build schedule. sorry for the hassle here, but I think we are getting close to a resolution.
Reporter | ||
Comment 58•16 years ago
|
||
VMs stayed up and stable overnight, so lets leave well enough alone for now. Thanks for all the help over the holiday weekend! Tomorrow (Tuesday) morning, we'll have a go/nogo decision for rc2. If we do not need rc2, we can move back to netapp-c immediately. If we do need rc2, we'll need to wait until after builds&updates are handed to QA before we can move back to netapp-c, so likely Thursday.
Reporter | ||
Comment 59•16 years ago
|
||
Per Firefox3 meeting just now, this is on hold until after we produce FF3.0rc2 builds.
Priority: P1 → P3
Comment 60•16 years ago
|
||
I stumbled on this while looking for an update on FF3... My org went through a string of very similar issues last September on ESX 3.0.1/3.0.2 connecting to a NetApp. It affected both us using both iSCSI and NFS from the ESX kernel and from within the VM's. The problem with 3.0.2 was MUCH worse than with 3.0.1... we've rolled back to 3.0.1. What version were you at pre-3.5? I'm 99% sure this is going to turn out to be problem with VMware and/or other hardware, not the NetApp (we had the same issue with a CoRAID device and with a NetApp at the same time). The May 8th ESX upgrade & May 9th error messages are most likely not a coincidence. Be careful putting your investigations on hold for a long period of time with the VM's running. Disk errors at the VM level _can_ end up corrupting the VM's OS--sometimes to the point where the OS can't boot anymore (particularly if you're dropping iSCSI connections altogether). A few things to look for: - Timing: clock drift, NTP problems, kerberos problems (can impact iSCSI connections), incorrect time/date etc. on VM's and on ESX servers. - Network: . any recent changes? upgrades? re-configurations? firewall policies? . switch/router health/load . MTU settings/capabilities . duplex settings . cables (don't overlook this) . trace routes from VMs to NetApp, from ESX to NetApp, etc. Anything unusual? - Web servers: incomplete/truncated downloads (particularly on larger files or when the network is under heavy load). - DB Servers: check for anything unusual in the DB logs. - ESX Servers: . try copying a large file from one of the iSCSI mount points to another location a few times (i.e., a large .vmdk file). Look for messages like "cp: reading `vm_name-flat.vmdk': Input/output error" . how and where are you mounting your iSCSI connections? With VMware's software iSCSI Initiator? hardware initiator? Within VM's? - Linux VMs: . filesystems unmounting and re-mounting as read-only? . files disappearing/re-appearing/changing sizes sporadically, particularly on network-based file systems (NFS in our case). - Windows VMs (Event Log Viewer): . symmpi:15:"The device, \Device\Scsi\symmpi1, is not ready for access yet." . Disk:11:"The driver detected a controller error on \Device\Harddisk0." . Application Popup:333:"An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system's image of the Registry." . iScsiPrt:7:"The initiator could not send an iSCSI PDU. Error status is given in the dump data." (if you are making iSCSI connections from within a VM) . iScsiPrt:20:Connection to the target was lost. The initiator will attempt to retry the connection. Our attempted solutions: - Rollback to 3.0.1 - Test/replace network cables - Upgrade network equipment (jumbo frame support, dedicated gigabit ports) - Completely isolate storage-related networks on their own dedicated switches (we're still in the process of doing this) - Switch to fiber (ideally) or NFS from iSCSI where possible. We're not really sure why the problem went away. We fixed as many things as we could identify as possible causes. The biggest impact we could observe was when we rolled back to ESX 3.0.1. Hope this helps & sorry to read this. I'm curious to see how things work out for you... Hopefully it won't be as painful an ordeal for you as it was for us.
Reporter | ||
Comment 61•16 years ago
|
||
(In reply to comment #59) > Per Firefox3 meeting just now, this is on hold until after we produce FF3.0rc2 > builds. > ok, builds and updates are now generated. Justin, sorry today's release took longer then expected, its now ok to try those NetApp diagnostic tests.
Assignee: joduinn → justin
Reporter | ||
Comment 62•16 years ago
|
||
(In reply to comment #60) hi Scott; > I stumbled on this while looking for an update on FF3... My org went through a > string of very similar issues... [snip] > Be careful putting your investigations on hold for a long period of time with > the VM's running. Disk errors at the VM level _can_ end up corrupting the VM's > OS--sometimes to the point where the OS can't boot anymore (particularly if > you're dropping iSCSI connections altogether). Once we got things stable over weekend, we decided to hold off investigations until this evening, because it was being so disruptive to VMs on tinderbox... hence disruptive to developers trying to land last few patches for FF3.0rc2. Now that FF3.0rc2 is fully built and handed off to QA, we're back looking at this again. Its quite a worry. > A few things to look for: [snip] > - Linux VMs: > . filesystems unmounting and re-mounting as read-only? > . files disappearing/re-appearing/changing sizes sporadically, particularly > on network-based file systems (NFS in our case). We have been hit by this for a while now. The VMware response was to have us do kernel updates. See bug#407796 for details. That might or might have solved the intermittent problem, hard to tell with the recent mischief from NetApp. [snip] > Hope this helps & sorry to read this. I'm curious to see how things work out > for you... Hopefully it won't be as painful an ordeal for you as it was for us. Your comments were very helpful, thanks for all the info. Sounds like an ugly experience, I also hope its not as painful for us!! tc John.
Priority: P3 → P1
Comment 63•16 years ago
|
||
Justin, what's the current status on this ? Is netapp-c in normal mode or failed over to -d at the moment ? We're seeing VMs with problems over the last day or so, eg bug 407796 comment #86 and #87 but also some compile errors which look like corruption.
Assignee | ||
Comment 64•16 years ago
|
||
Escalated this more and got more information. Netapp believes it's an issue with the LUN. Current plan of action is to storage vmotion all VMs off a LUN (without downtime) to a another array we have, rebuild the LUN then move the VMs back. Storage vmotion takes some time so this is a slow process. If there are any VMs that could stand an hour of downtime, please let us know so we can move them faster. Given the issue is with the LUN itself, netapp seems confident failing to netapp-d won't solve the issues (as we still saw them in the failed over state). I'll ping you guys today in IRC to go over the plan more.
Reporter | ||
Comment 65•16 years ago
|
||
Approx 10.05 this morning, staging-pacifica-vm02 hit: remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ...which looks like it lost connection to staging-1.8-master. The next few attempts to build failed out with different errors. By approx 10.30 this morning, staging-pacifica-vm02 was back running normally again. Are either of these VMs on the suspect LUN?
Comment 66•16 years ago
|
||
Yes. Any VM with a datastore name containing "netapp" in it is.
Reporter | ||
Comment 67•16 years ago
|
||
Discovered qm-centos5-moz2-01 down just now, with a linux scsi kernel panic. Unclear how long its been down. Justin restarted VM.
Reporter | ||
Comment 68•16 years ago
|
||
qm-win2k3-moz2-01 was broken, first with: remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ...and then with file permission errors trying to delete files, causing all builds to fail. I've restarted the VM.
Assignee | ||
Comment 69•16 years ago
|
||
qm-win2k3-moz2-01 = netapp-c-fcal1, expected. Thanks John.
Looks like prod-pacifica-vm is hitting this again?
Reporter | ||
Comment 74•16 years ago
|
||
prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp woes. I'll start to repair and restart these VMs right now, but unclear how long they will remain up before being hit by this problem again.
Assignee | ||
Comment 75•16 years ago
|
||
(In reply to comment #74) > prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp > woes. > > I'll start to repair and restart these VMs right now, but unclear how long they > will remain up before being hit by this problem again. > We are blocked on making progress on the core netapp issue as we can't move the last two machines on the shelf per build. On hold till Monday.
Reporter | ||
Comment 76•16 years ago
|
||
prod-pacifica-vm and balsa-18branch now repaired and restarted. They're past where they were failing before, and I hope are now ok. I'll keep watching over the next few hours. crazyhorse is still going through fsck problems with corrupted superblocks.
Reporter | ||
Comment 77•16 years ago
|
||
(In reply to comment #75) > (In reply to comment #74) > > prod-pacifica-vm, balsa-18branch and crazyhorse all hit by continued NetApp > > woes. > > > > I'll start to repair and restart these VMs right now, but unclear how long they > > will remain up before being hit by this problem again. > > > > We are blocked on making progress on the core netapp issue as we can't move the > last two machines on the shelf per build. On hold till Monday. Errr... I thought the only VMs not being moved where the buildbot masters controlling all the talos and unittest machines. What happened to the other VMs thats were in transition Friday - are they now moved? And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm and balsa-18branch) on - and can these be moved?
Reporter | ||
Comment 78•16 years ago
|
||
balsa-18branch failed out again. It made it further then comment#74, but failed out with similar errors. While repairing again, I hit this weirdness: $ pwd /builds/tinderbox/Fx-Mozilla1.8-gcc3.4/Linux_2.4.7-10_Depend $ mv mozilla mozilla.bad.002 $ mkdir mozilla mkdir: cannot create directory `mozilla': File exists $ $ ls -la ... drwxrwxrwx 11 cltbld users 4096 Jun 7 19:38 mozilla drwxrwxrwx 53 cltbld users 4096 Jun 7 18:46 mozilla.bad.001 drwxr-xr-x 58 cltbld users 4096 Jun 7 19:38 mozilla.bad.002 ... After 30+ seconds the "mozilla" directory disappeared?!?! Since when are "mv" operations not atomic? I've now re-rebooted balsa-18branch and re-restarted tinderbox on it. Fingers crossed.
Reporter | ||
Comment 79•16 years ago
|
||
crazyhorse is now up and building. It took a few fsck-with-repairs and reboot cycles by Trevor, and also two fsck-with-repairs and reboot cycles by me. Build started, lets see how it goes. Fingers crossed.
Assignee | ||
Comment 80•16 years ago
|
||
> Errr... I thought the only VMs not being moved where the buildbot masters
> controlling all the talos and unittest machines.
>
> What happened to the other VMs thats were in transition Friday - are they now
> moved? And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm
> and balsa-18branch) on - and can these be moved?
They were all done Saturday morning - just waiting on those last two to re-build the LUN.
Reporter | ||
Comment 81•16 years ago
|
||
production-prometheus-vm.build.m.o started failing out at 00:07 with file access problems and then cvs refresh problems, one of the common symptoms we've seen recently with this NetApp issue. Without any action from me, production-prometheus-vm is now passing the point where it would usually fail. It looks like it might have self-recovered somehow, so I'm not going to kill/restart it.
Comment 82•16 years ago
|
||
balsa-18branch, crazyhorse, production-prometheus-vm, and prod-pacifica-vm are all red right now. Three of those build nightlies.
Reporter | ||
Comment 83•16 years ago
|
||
(In reply to comment #80) > > Errr... I thought the only VMs not being moved where the buildbot masters > > controlling all the talos and unittest machines. > > > > What happened to the other VMs thats were in transition Friday - are they now > > moved? > They were all done Saturday morning - just waiting on those last two to > re-build the LUN. Great. > > And what shelf are these failing slaves (crazyhorse, prod-pacifica-vm > > and balsa-18branch) on - and can these be moved? These failed again. Question remains... can these be moved also or do we have to leave them where they are until after LUN repair?
Comment 84•16 years ago
|
||
(In reply to comment #83) > These failed again. Question remains... can these be moved also or do we have > to leave them where they are until after LUN repair? If that question is for me, my question back would be: How long of a down time are we talking about to move them? If it's less than six or so hours, move them now so we can get nightlies tomorrow and hopefully give a go to release.
Reporter | ||
Comment 86•16 years ago
|
||
(In reply to comment #85) > *** Bug 437893 has been marked as a duplicate of this bug. *** Loss of qm-rhel02 in bug#437893 caused Talos machines to go offline, and close moz2 and mozilla1.9 branches. (In reply to comment #84) > (In reply to comment #83) > > These failed again. Question remains... can these be moved also or do we have > > to leave them where they are until after LUN repair? > > If that question is for me, my question back would be: How long of a down time > are we talking about to move them? If it's less than six or so hours, move them > now so we can get nightlies tomorrow and hopefully give a go to release. Question was for Justin. I've since talked to Justin on phone. Once he's finished moving the masters (above), he'll start moving these slaves also. Because of the sick LUN, there are limits on how many VMs he can migrate at a time. Justin will update this bug with info as the VMs are migrated.
Reporter | ||
Comment 87•16 years ago
|
||
Specifically qm-rhel02 and production-master are both now down, so they can be moved to healthy NetApp.
Assignee | ||
Comment 88•16 years ago
|
||
both of these have been moved - rebuilding the shelf now. After this, one more to go.
Comment 89•16 years ago
|
||
Fixed up balsa-18branch this morning, crazyhorse has a fsck in progress. Attempted to move production-pacifica-vm to d-sata-build-003 but it failed with "The virtual disk is either corrupted or not a supported format." Ran chkdsk and got the same error again after getting to a few tens of percent. Is this really a network error during the copy ? Also attempted to "Complete migration" on production-prometheus-vm, it failed out in with similar messages to Administrator's attempts (=justin?).
Reporter | ||
Comment 90•16 years ago
|
||
We just powered off all VMs on netapp-c-fcal, see list below. The trees are closed anyway because of failing VMs, so this seems worth trying. In theory this should make it faster to move them off this LUN. bm-symbolfetch01 bm-win2k3-pgo01 build-console fx-linux-tbox fx-win32-tbox l10n-win32-tbox mobile-linux-slave1 moz2-master moz2-win32-slave1 production-pacifica-vm production-prometheus-vm production-trunk-automation qm-centos5-moz2-01 qm-moz2-unittest01 qm-win2k3-moz2-01 qm-win2k3-stage-pgo01 staging-build-console staging-pacifica-vm staging-trunk-automation tbnewref-win32-tbox try2-linux-slave try2-win32-slave
Reporter | ||
Comment 91•16 years ago
|
||
Justin and mrz think they've nailed it (as of 11pm-ish). A few remaining VMs still need to be moved from netapp-c-fcal1: l10n-win32-tbox moz2-win32-slave1 qm-win2k3-stage-pgo01 staging-pacifica-vm staging-1.9-master tbnewref-win32-tbox try2-linux-slave try2-win32-slave There are known issues with karma and prometheus that need to be fixed (from irc with justin). The production-1.9-master and fx-linux-1.9-slave2 VMs are both confirmed ok, and now in use.
Reporter | ||
Comment 92•16 years ago
|
||
We'll now start bringing back up all the various build/unittest/talos machines, repairing corrupted file-systems as needed. Depending on how many VMs need how much repair, we expect to start reopening the 1.8/1.9/moz2 trees sometime Tuesday. Watch this space.
Comment 93•16 years ago
|
||
(In reply to comment #91) > Justin and mrz think they've nailed it (as of 11pm-ish). A few remaining VMs > still need to be moved from netapp-c-fcal1: > l10n-win32-tbox > moz2-win32-slave1 > qm-win2k3-stage-pgo01 > staging-pacifica-vm > staging-1.9-master > tbnewref-win32-tbox > try2-linux-slave > try2-win32-slave > Looks like these have all been moved. l10n-win32-tbox is still shutdown. Can we start it again?
Comment 94•16 years ago
|
||
tb-linux-tbox is in a kernel panic right now. It's still at least partly on netapp-c-001 so I'll wait for an OK before restarting it.
Comment 95•16 years ago
|
||
The only remaining misconfigured LUN is netapp-c-fcal1. netapp-c-001 should be fine. Oh, and it's only on netapp-c-001 in the sense that it's cdrom drive is some ISO image off that datastore which I can't seem to change while the host is running (or perhaps because vmware tools isn't yet running).
Reporter | ||
Comment 96•16 years ago
|
||
At this point, all machines/VMs are back up, and staying up, *except*: 1.9 and moz2 ============ qm-rhel02 (runs talos, unittest) 1.8 === crazyhorse karma production-prometheus-vm We're making progress, but will continue to keep all these trees closed for now. As we get VMs repaired, we'll update this bug.
We seem to have sufficient box coverage on the Mozilla2 tinderbox now. Do we need to keep the tree closed?
Comment 98•16 years ago
|
||
It is of a lesser priority since it is a staging machine, but I'm still missing qm-buildbot01.
Reporter | ||
Comment 99•16 years ago
|
||
(In reply to comment #96) > At this point, all machines/VMs are back up, and staying up, *except*: > > 1.9 and moz2 > ============ > qm-rhel02 (runs talos, unittest) ...has been working without going read-only since mid morning. Looks like the recurring read-only-ness might be resolved also by last nights fix? We've let a few unittest and talos runs go through to verify that all are connected and working again. It all looks good, so I've now reopened the tree for mozilla-central. Work continues on reparing the VMs needed for 1.8.
Comment 100•16 years ago
|
||
(In reply to comment #98) > It is of a lesser priority since it is a staging machine, but I'm still missing > qm-buildbot01. Totally left off the RADAR - it's on a classic QA ESX server. Might as well take the opportunity to move the VM over to Build land.
Comment 101•16 years ago
|
||
I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's misbehaving. I was unable to resurrect it after the event this morning.
Comment 102•16 years ago
|
||
(In reply to comment #100) > (In reply to comment #98) > > It is of a lesser priority since it is a staging machine, but I'm still missing > > qm-buildbot01. > > Totally left off the RADAR - it's on a classic QA ESX server. Might as well > take the opportunity to move the VM over to Build land. The lives on bm-vmware02 now and it's up.
Reporter | ||
Comment 103•16 years ago
|
||
(In reply to comment #96) > 1.8 > === > crazyhorse tracked in bug#437798 > karma > production-prometheus-vm tracked in bug#438386 (In reply to comment #101) > I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's > misbehaving. I was unable to resurrect it after the event this morning. Is qm-centos5-01 still a problem? Can it be repaired or do we need to create new clone?
Comment 104•16 years ago
|
||
(In reply to comment #103) > (In reply to comment #101) > > I've disabled qm-centos5-01 from the tinderbox waterfall (Firefox) because it's > > misbehaving. I was unable to resurrect it after the event this morning. > Is qm-centos5-01 still a problem? Can it be repaired or do we need to create > new clone? Based on my efforts with it yesterday, I think it will need to be rebuilt. Maybe mrz can resurrect it where I have failed though.
Comment 105•16 years ago
|
||
qm-vista02 on qm-vmware01 is dead in the water. Unable to power it on or access it via the VMWare console or RDP.
Comment 106•16 years ago
|
||
Does it say (inaccessible) after it in Virtual Infrastructure client? Try to remove it from inventory, rename the directory the VM files are in, and manually re-add it to the inventory. If it won't re-add, see if you can recover or move the VM's folder to another location and re-add it from a different ESX server/cluster. You're probably in better shape if it won't even start up than you are if the VM starts with BIOS or OS errors about the disk.
Comment 107•16 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=438664 looking to re-image qm-centos5-02 as it is being erratic and not building well at all.
Comment 108•16 years ago
|
||
(In reply to comment #105) > qm-vista02 on qm-vmware01 is dead in the water. Unable to power it on or access > it via the VMWare console or RDP. thought that was in another bug - should be up.
Assignee | ||
Updated•16 years ago
|
Comment 109•16 years ago
|
||
karma is up and running again
Reporter | ||
Comment 110•16 years ago
|
||
From bugs and irc, heres the list of VMs reported as problems and being repaired today. Most are now fixed, and confirmed working ok. [ ] crazyhorse [+] karma [+] production-prometheus-vm [+] qm-buildbot01 [+] qm-vista02 [+] qm-centos5-01 [ ] qm-centos5-02 We just have qm-centos5-02 and crazyhorse being worked on.
Reporter | ||
Comment 111•16 years ago
|
||
crazyhorse is working. [+] crazyhorse [+] karma [+] production-prometheus-vm [+] qm-buildbot01 [+] qm-vista02 [+] qm-centos5-01 [ ] qm-centos5-02 qm-centos5-02 is still being worked on, it looks like an application problem, but still investigating.
Reporter | ||
Comment 112•16 years ago
|
||
qm-centos5-02 is working. [+] crazyhorse [+] karma [+] production-prometheus-vm [+] qm-buildbot01 [+] qm-vista02 [+] qm-centos5-01 [+] qm-centos5-02 Closing!
Status: NEW → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Comment 113•16 years ago
|
||
For the past ~24 hours, there seem to have been issues with qm-win2k3-pgo01 (bug 440531) and qm-xserve01 (bug 440536). I just filed those two bugs for these issues, but I'm mentioning it here as well, per this message on the Tinderbox page: > ... but if you see weirdness, not caused by checkins > please add details to bug bug 435134 .
Assignee | ||
Comment 114•16 years ago
|
||
xserve01 is a physical machine, so I doubt that one is related...
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•