Closed
Bug 407796
(read-only)
Opened 17 years ago
Closed 15 years ago
update linux VM kernels to same new stable version
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: joduinn)
References
Details
It refuses to mount rw: [root@production-trunk-automation ~]# pwd /root [root@production-trunk-automation ~]# mount -o remount,rw / mount: block device /dev/sda1 is write-protected, mounting read-only [root@production-trunk-automation ~]# touch foo touch: cannot touch `foo': Read-only file system [root@production-trunk-automation ~]# This is blocking Firefox 3.0 Beta 2
Comment 1•17 years ago
|
||
From dmesg after reboot: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1963514 Aborting journal on device sda1. ext3_abort called. EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda1) in ext3_truncate: Journal has aborted EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda1) in ext3_orphan_del: Journal has aborted EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data ext3_abort called. EXT3-fs error (device sda1): ext3_remount: Abort forced by user ext3_abort called. EXT3-fs error (device sda1): ext3_remount: Abort forced by user
Comment 2•17 years ago
|
||
There is nothing named production-trunk-automation in VI, so I can't troubleshoot this - what's it called in VI?
Assignee | ||
Comment 3•17 years ago
|
||
justin and I confirmed its production-trunk-master on bm-vmware07, before justin got dragged off to something else.
Assignee | ||
Comment 4•17 years ago
|
||
we also confirmed that after it failed out the first time, we rebooted the vm and hit the same error again. Aravind is looking at it now, while Justin is in a meeting.
Updated•17 years ago
|
Assignee: server-ops → aravind
Comment 5•17 years ago
|
||
Box is up now. There are some other reports on lkml about that kernel causing problems like that. But no-one has seems to have confirmed the actual bug.
Comment 6•17 years ago
|
||
Folks tell me we can't upgrade kernels on these boxes willy-nilly so we should probably figure out a good time to ensure that these boxes have the latest and greatest kernels for that centos/rhel release.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 7•17 years ago
|
||
This happened again this morning. Slightly different errors this time: sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 17843020 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 5767279 Buffer I/O error on device sda1, logical block 720902 lost page write due to I/O error on sda1 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 5767663 Buffer I/O error on device sda1, logical block 720950 lost page write due to I/O error on sda1 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 17843036 Buffer I/O error on device sda2, logical block 5377 lost page write due to I/O error on sda2 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 18537636 Buffer I/O error on device sda2, logical block 92202 lost page write due to I/O error on sda2 Aborting journal on device sda2. journal commit I/O error ext3_abort called. EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 5767279 EXT3-fs error (device sda1): ext3_get_inode_loc: unable to read inode block - inode=720331, block=720902 Aborting journal on device sda1. EXT3-fs error (device sda1) in ext3_reserve_inode_write: IO failure EXT3-fs error (device sda1) in ext3_dirty_inode: IO failure ext3_abort called. EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only
Severity: blocker → critical
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 8•17 years ago
|
||
Rebooted and it fscked again. It's OK now. Do these I/O errors indicate hardware problems on the netapp?
Comment 9•17 years ago
|
||
No - if hey did, all the VMs on the netapp would be having issues, and in this case, none of them are. Aravind found links to this error which referenced issues with the specific kernel you are using, but John asked us to wait to upgrade until after beta.
Severity: critical → major
Comment 10•17 years ago
|
||
Please re-open this when we can get together and figure out how to go about upgrading kernels on these buildbot boxes.
Status: REOPENED → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → INCOMPLETE
Assignee | ||
Comment 11•17 years ago
|
||
Reopening. We have shipped beta2, and we have a quiet period between releases, so now would be a good time to do this.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Comment 12•17 years ago
|
||
Is this going to be just a kernel update ? If the nightly and release boxes are diverging (by updating the later from the reference platform) we ought to consciously OK any tool-chain changes.
Comment 13•17 years ago
|
||
(In reply to comment #12) > Is this going to be just a kernel update ? If the nightly and release boxes are > diverging (by updating the later from the reference platform) we ought to > consciously OK any tool-chain changes. This bug is only about the consoles, right? (build-console, production-trunk-automation, and the respective staging boxes)
Comment 14•17 years ago
|
||
So it is, I was remembering the similar problems we've occasionally had with xr-linux-tbox and l10n-linux-tbox.
Comment 15•17 years ago
|
||
Yup, to begin with I just want a kernel upgrade. I would like to upgrade to the latest and greatest stable 2.6 kernel (if it works). Which boxes can I do this on? Or would you guys want to handle these upgrades yourself?
Comment 16•17 years ago
|
||
Guys, any update on this?
Reporter | ||
Comment 17•17 years ago
|
||
staging-trunk-automation needs this soon; it got mounted ro over the weekend again. If you're around tomorrow that would be a good time. If not, let's do it Wednesday. It's OK to shutdown/reboot this machine when you're ready, please give me a day/time though. Other machines need this too, but maybe we should wait until we see how this one goes to do them?
Assignee | ||
Comment 18•17 years ago
|
||
Sorry Aravind; I forgot to update this bug with the machine names after our chat-walking-to-lunch. The four buildmasters which should get new kernel upgrade are: staging-trunk-automation.build.mozilla.org production-trunk-automation.build.mozilla.org staging-build-console.build.mozilla.org build-console.build.mozilla.org How about starting with the two staging VMs first, and if all that goes ok, then we can do the other two VMs? Anytime that suits you is fine with us, just ping us on IRC before you start, so we can stop any work while you upgrade, ok?
Comment 19•17 years ago
|
||
I can do the upgrade, sometime today if thats okay with you guys. I agree with bhearsum that we should probably just do this one to see if it helps before we upgrade others. But, if you guys want, I can upgrade the rest of them as well.
Assignee | ||
Comment 20•17 years ago
|
||
staging-build-console is free, if you want to upgrade that now, go for it!? Please hold off upgrading staging-trunk-automation right now, as its currently running a test which will take another 8 hours to complete. We'll update this bug as soon as its done. After the two staging VMs are upgraded, how about we test them - if all tests look ok, we'll give the goahead to update the production VMs in a day or two? Does that sound ok to you?
Comment 21•17 years ago
|
||
staging-build-console upgraded from 2.6.9-42.ELsmp to 2.6.9-67.0.1.ELsmp.
Reporter | ||
Comment 22•17 years ago
|
||
Looks like fx-linux-tbox had this problem recently, bug 410386. We should get it upgraded at some point, too.
Comment 23•17 years ago
|
||
(In reply to comment #22) > Looks like fx-linux-tbox had this problem recently, bug 410386. We should get > it upgraded at some point, too. justdave tried a new kernel, and it had the same problem.
Comment 24•17 years ago
|
||
Given build owns the OS/build environment, I'd rather you guys take over the kernel updates (given we don't handle any security updates for these machines) to avoid delays and back and forth of coordinating schedules. Also, I've been told having a consistent and reliable build environment is paramount, so I think build should handle any major changes such as kernel updates rather than us changing things... Aravind - can you document the process for updating the kernels in this bug? Once done, we'll move the bug to the build queue. If the new kernel proves not to be the solution, server-ops should keep it until a solution is found, then hand it off to build for the mass changes.
Comment 25•17 years ago
|
||
(In reply to comment #24) > Aravind - can you document the process for updating the kernels in this bug? > Once done, we'll move the bug to the build queue. If the new kernel proves not > to be the solution, server-ops should keep it until a solution is found, then > hand it off to build for the mass changes. > Sure, all I did was to run "yum update kernel kernel-smp". That will install the new kernels. If you are running vmware guest tools, then you'd need to console in and run vmware-config-tools.pl which will re-do your interfaces for you. Let me know if you need more information.
Comment 26•17 years ago
|
||
fx-linux-tbox has a newer kernel available (2.6.18-53.1.4.el5). I can upgrade this box whenever you guys are ready.
Reporter | ||
Comment 27•17 years ago
|
||
I don't think it's too urgent to do this. Whenever the next downtime/tree closure is seems like a good time to take care of it.
Reporter | ||
Comment 28•17 years ago
|
||
/var on staging-trunk-automation got mounted read-only after the kernel upgrade was done. I _think_ the kernel upgrade helped a bit, but it's hard to measure this.
Assignee | ||
Comment 29•17 years ago
|
||
Given that this one has already recreated the failing read-only case, not sure its worth rolling out to other machines. Is there another/newer kernel patch that might be worth trying?
Comment 30•17 years ago
|
||
(In reply to comment #29) > Given that this one has already recreated the failing read-only case, not sure > its worth rolling out to other machines. > > Is there another/newer kernel patch that might be worth trying? > Which box are you talking about? I upgraded staging-build-console.
Comment 31•17 years ago
|
||
For record keeping, tb-linux-tbox had /builds go ro today (bug 412412).
Assignee | ||
Comment 32•17 years ago
|
||
(In reply to comment #28) > /var on staging-trunk-automation got mounted read-only after the kernel upgrade > was done. I _think_ the kernel upgrade helped a bit, but it's hard to measure > this. Bhearsum and I talked about this last night, and cannot confirm if the read-only problem was actually seen on staging-trunk-automation (kernel not upgraded) or on staging-build-console (kernel upgraded). Therefore, its too early to rule out if the new kernel fixed the problem. Lets watch staging-build-console to see if it happens again.
Assignee | ||
Comment 33•17 years ago
|
||
(In reply to comment #31) > For record keeping, tb-linux-tbox had /builds go ro today (bug 412412). fyi: Aravind confirmed last night that tb-linux-tbox did not get this kernel update.
Comment 34•17 years ago
|
||
(In reply to comment #33) > (In reply to comment #31) > > For record keeping, tb-linux-tbox had /builds go ro today (bug 412412). > fyi: Aravind confirmed last night that tb-linux-tbox did not get this kernel > update. > Given the instructions are in the bug - can you upgrade and see if you still see the r/o issue? Let's track which machines have the new kernel here - so far staging-build-console is the only one so far, and no confirmation of if the issue still exists.
Comment 35•17 years ago
|
||
Talked to John about this. I am moving this back to build and release, since at this point this is just a waiting game. B&R has the instructions on how to update the kernels. If they find that the new kernels don't fix the problems, please move it back to us and I will look into compiling or getting pre-built newer kernels.
Assignee: aravind → nobody
Status: REOPENED → NEW
Component: Server Operations → Build & Release
QA Contact: justin → build
Assignee | ||
Updated•17 years ago
|
Assignee: nobody → joduinn
Assignee | ||
Updated•17 years ago
|
Priority: -- → P2
Assignee | ||
Comment 36•17 years ago
|
||
staging-build-console.build.mozilla.org was renamed to staging-1.8-master (already updated by Aravind). staging-trunk-automation.build.mozilla.org was renamed to staging-1.9-master production-trunk-automation.build.mozilla.org was renamed to production-1.9-master build-console.build.mozilla.org was renamed to production-1.8-master I'll start doing these 3 today.
Status: NEW → ASSIGNED
Reporter | ||
Comment 37•17 years ago
|
||
I've updated staging-1.9-master to 2.6.18-53.1.6.el5.
Assignee | ||
Comment 38•17 years ago
|
||
staging-trunk-automation.build.mozilla.org / staging-1.9-master is now updated to 2.6.18-53.1.4.el5, and rebooted. Note: we also had to do "yum install kernel-headers-2.6.18-53.1.4.el5". Thanks to bhearsum for helping with updating the kernel-source-headers, needed for rebuilding vmware packages, as part of this kernel update.
Reporter | ||
Comment 39•17 years ago
|
||
(In reply to comment #38) > Note: we also had to do "yum install kernel-headers-2.6.18-53.1.4.el5". Thanks > to bhearsum for helping with updating the kernel-source-headers, needed for > rebuilding vmware packages, as part of this kernel update. > Actually, I installed kernel-headers.i386 and kernel-devel.i686. kernel-devel seems to be the necessary one for vmware tools.
Assignee | ||
Comment 40•17 years ago
|
||
to clarify: joduinn installed kernel-headers-2.6.18-53.1.4.el5 bhearsum installed kernel-headers.i386 and kernel-devel.i686 No longer sure which one was "needed", but wanted to clarify what exactly was done.
Assignee | ||
Comment 41•16 years ago
|
||
For record keeping, fx-linux-tbox had /var turn "ro" today.. see details in bug#18471.
Assignee | ||
Comment 42•16 years ago
|
||
As "su", I did the following: # yum update kernel # yum update kernel-headers # yum update kernel-devel # shutdown -r now ...updating each of the following: staging-1.9-master (centos5) now at 2.6.18-53.1.13.el5 production-1.9-master (centos5) now at 2.6.18-53.1.13.el5 staging-1.8-master (centos4) now at 2.6.9-67.0.1.EL production-1.8-master (centos4) now at 2.6.9-67.0.1.EL ...and confirmed each buildbot master seems to be up and pinging to slaves ok.
Assignee | ||
Comment 43•16 years ago
|
||
Updated the following: fx-linux-1.9-slave1 (centos5) now at 2.6.18-53.1.13.el5 fx-linux-1.9-slave2 (centos5) now at 2.6.18-53.1.13.el5 ...and confirm each buildbot slave up and pinging master ok.
Assignee | ||
Comment 44•16 years ago
|
||
As "su", I did the following: # up2date -f kernel # shutdown -r now For this older version of RedHat, there were no "kernel-headers" or "kernel-devel". I updated the following VMs: staging-prometheus-vm (RedHat Ent3update8) now at 2.4.21-27.0.4.EL production-prometheus-vm (RedHat Ent3update8) now at 2.4.21-27.0.4.EL ...and confirmed each buildbot slave is up and pinging master ok.
Assignee | ||
Comment 45•16 years ago
|
||
As "su", I did the following: # yum update kernel # yum update kernel-headers # yum update kernel-devel # shutdown -r now ...updating each of the following: fx-linux-tbox (centos5) now at 2.6.18-53.1.13.el5 fxdbug-linux-tbox (centos5) now at 2.6.18-53.1.13.el5 ...and confirmed tinderbox processes are up and running ok. Note: during the fxdbug-linux-tbux update, it initially refused to reboot. After help from justdave, we discovered two things: 1) the instructions for this kernel update missed needing to install vmware tools on the VMs. I will have to go back to each upgraded machine and confirm this is in place. 2) the setup of fxdbug-linux-tbox is different to all the other machines, in how rc.local was setup. This caused problems when trying to reboot after the kernel update. Its unclear why this machine is different to all others. Its unclear what is the "right" setting. See details in bug#420007
Comment 46•16 years ago
|
||
1) Yes they did. The only ones I see in this bug do anyway. See comment 25. 2) Do the others have that rc.local hack already? If not, there's nothing any different. The rc.local hack wouldn't work in your case anyway because the new kernel was built with a different gcc than the one that's on the system, do the vmware-config-tools.pl will refuse to install anyway without a manual override.
Comment 47•16 years ago
|
||
xr-linux-tbox needed restarting for r/o partitions today.
Assignee | ||
Updated•16 years ago
|
Assignee: joduinn → nobody
Status: ASSIGNED → NEW
Component: Build & Release → Release Engineering
QA Contact: build → release
Summary: production-trunk-automation VM's / drive is being mounted ro → update linux VM kernel to fix intermittent "nfs drive mounted ro" problem
Assignee | ||
Updated•16 years ago
|
Assignee: nobody → joduinn
Comment 48•16 years ago
|
||
Just ran into this on the newly minted fx-linux-1.9-slave1, we need to get this kernel upgrade into the ref platform (if indeed it helps this problem). (In reply to comment #47) > xr-linux-tbox needed restarting for r/o partitions today. Was this one upgraded?
Assignee | ||
Comment 49•16 years ago
|
||
No, we stopped after hitting the VMtools problem in comment#45 and comment#46. We'll need to figure the VMtools problem out before we can resume.
Component: Release Engineering: Talos → Release Engineering
Comment 50•16 years ago
|
||
production-1.8-master had /data (/dev/sdd1) go r/o today. It has an updated kernel per comment #42.
Reporter | ||
Comment 51•16 years ago
|
||
sm-try2-linux-slave was also hit. I updated the kernel and rebooted it.
Alias: read-only
Summary: update linux VM kernel to fix intermittent "nfs drive mounted ro" problem → update linux VM kernel to fix intermittent "drive mounted ro" problem
Reporter | ||
Comment 52•16 years ago
|
||
fx-linux-1.9-slave1 (the new one) was hit today. Upgraded the kernel and rebooted it.
Reporter | ||
Comment 53•16 years ago
|
||
sm-try1-linux-slave was hit too.
Comment 54•16 years ago
|
||
fyi, these were all on netapp-c-fcal1.
Assignee | ||
Comment 55•16 years ago
|
||
robcee hit the same r/o problem this morning with: qm-moz2-unittest01 qm-moz2-centos5-01 ...so he upgraded the kernel on both of these.
Assignee | ||
Comment 56•16 years ago
|
||
(In reply to comment #50) > production-1.8-master had /data (/dev/sdd1) go r/o today. It has an updated > kernel per comment #42. Nick: yikes. Aravind, mrz, Justin: The production-1.8-master VM was already updated with the kernel that supposedly fixed these read-only problems. Now what? Should I continue rolling out this kernel update, or is there something else causing this problem? Is production-1.8-master also on netapp-c-fcall, and if so, could there be a problem with that netapp?
Assignee | ||
Comment 57•16 years ago
|
||
Justin and myself talked earlier this week, didnt get a chance to update this bug until now. 1) I should continue with the kernel rollout. Its fixing known kernel bugs anyway, and a good habit to do. For comparison, IT does this approx once a quarter across all their own systems out of proactive maintenance. 2) We should file a separate bug for IT to track this intermittent read-only issue. It possibly not just a VM-kernel issue. It might also be a problem with a specific ESX host or a specific shelf. I've now filed bug#430821.
Comment 58•16 years ago
|
||
Two r/o failings today: bm-centos5-moz2-01 qm-centos05-02 I'm updating the kernel on both now, and will update qm-centos5-03 at the same time as a preventative measure.
Assignee | ||
Comment 59•16 years ago
|
||
qm-centos5-02 failed to boot after coop's kernel update this afternoon. Possible compiler differences while rebuilding vmtools? Kernel update was reverted by IT to stop qm-centos5-02 burning on tinderbox. For details, see bug#430820.
Depends on: 430820
Reporter | ||
Comment 60•16 years ago
|
||
production-master was hit over the weekend - it was not running the new kernel. bug 431136
Comment 61•16 years ago
|
||
Rebuilt the vmware modules on fx-linux-1.9-slave2 and rebooted, as the clock was drifting without the time sync. As per usual, ignored the warning about the kernel being compiled with gcc 4.1.2 while 4.1.1 is installed, and the failure to create the vmhgfs module (for the Shared folders feature that we don't use).
Assignee | ||
Comment 62•16 years ago
|
||
This morning, qm-centos5-02 was in readonly mode. Restarted. See Bug#432012 for details.
Assignee | ||
Comment 63•16 years ago
|
||
qm-centos5-02 is read-only again this morning.
Assignee | ||
Comment 64•16 years ago
|
||
Updated the kernel on qm-centos5-02 as follows: $ uname -a Linux qm-centos5-02.mozilla.org 2.6.18-8.el5 #1 SMP Thu Mar 15 19:57:35 EDT 2007 i686 i686 i386 GNU/Linux I've now updated the kernel using the following steps: As "su", I did the following: # yum update kernel kernel-headers kernel-devel # shutdown -r now ...and after reboot see: $ uname -a Linux qm-centos5-02.mozilla.org 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 2008 i686 i686 i386 GNU/Linux Finally, in VI client, noticed "VMware Tools: out of date" for qm-centos5-02. With mrz's supervision, I went into VI client, right click'd on the hostname, and picked "Install/Upgrade VMware tools", clicked ok, waited a couple of minutes to complete. Once completed, rebooted VM and confirmed that it now shows "VMware Tools: OK".
Comment 65•16 years ago
|
||
I didn't think the kernel version in the yum repo included the fix we needed for this?
Comment 66•16 years ago
|
||
according to vmware, it was back ported to centos/rhel 5.1, which is the kernel rev you are pulling, so you should be set.
Comment 67•16 years ago
|
||
sorry for the bugnoise. We're tracking these kernel upgrades in a few bugs now so it's a bit confusing. Seems to have fixed it for now on qm-centos5-02 though it's still failing for some other reason.
Assignee | ||
Updated•16 years ago
|
Whiteboard: needs scheduled downtime
Assignee | ||
Comment 68•16 years ago
|
||
After all the back-and-forth, along with updating kernels, we're now restarting this. For any centos4 VM, we're holding for dependent bug#432933. For any centos5 VM, we'll see if its running kernel 2.6.18-53.1.14.el5 (the newest available right now), which has the read-only fix. If so, great. Otherwise, if the VM is running an older kernel, we'll update it to 2.6.18-53.1.14.el5, using the instructions in comment#64.
Assignee | ||
Comment 69•16 years ago
|
||
staging-master.build.mozilla.org was on kernel 2.6.18-53.1.6.el5, and is now updated to kernel 2.6.18-53.1.19.el5. VM rebooted, and VMware tools updated.
Comment 70•16 years ago
|
||
qm-centos5-01 was 2.6.18-8.el5 qm-centos5-02 was 2.6.18-53.1.14.el5 qm-centos5-03 was 2.6.18-53.1.14.el5 All three now 2.6.18-53.1.19.el5 - also updated vmware tools on qm-centos5-02. I used yum update {kernel,kernel-devel,kernel-headers}-2.6.18-53.1.19.el5 to force the package version (so that we get the same version everywhere and aren't vulnerable to other kernel updates being released while completing our rollout).
Comment 71•16 years ago
|
||
qm-moz2-unittest01 was 2.6.18-8.el5 now 2.6.18-53.1.19.el5 as per method above. updated vmware tools as well.
Comment 72•16 years ago
|
||
... not true.
Comment 73•16 years ago
|
||
cloned copy of qm-centos5-moz2-01, qm-moz2-unittest01 is 2.6.18-53.1.14.el5.
Reporter | ||
Comment 74•16 years ago
|
||
staging-1.8-master went read-only on /, /builds, and /var this morning. When i was looking for a new kernel (before rebooting) I got a bus error: [root@staging-build-console /]# yum update kernel kernel-smp Bus error When I rebooted it, I had to manually fsck /builds before it would boot up. This is a CentOS 4.4 machine, so we only got up to 2.6.9-67.0.15.EL (CentOS 4.4). Nick tells me we need CentOS 4.5 to fix the problem here.
Comment 75•16 years ago
|
||
I don't think we (at least IT) has verified which kernel rev the vmware fix is in on the 4.x train. 5.x has been well documented - I'll see if I can't get an answer out of vmware today given aravind is out.
Comment 76•16 years ago
|
||
I was poking around on a Centos mirror and found http://www.centos.org/modules/smartfaq/faq.php?faqid=34 So it looks like updating the kernel to the "latest" will get us the latest-for-4.x rather than 4.4. They seem to be up to Centos 4.6 now, so we should be beyond the Centos 4 Update 5 requirement in the VMWare doc.
Comment 77•16 years ago
|
||
sounds good - looks like we are good to go with the kernel from 4.6. if possible, please post the kernel rev you settle on so we have docs on what all the machines should be at.
Assignee | ||
Comment 78•16 years ago
|
||
Tried to update staging-1.9-master, following steps in comment#70. However, as a new kernel (.1.21) is now available, these instructions dont work anymore. Updated staging-1.9-master with instructions that justdave came up with; these worked just great: $ su - # yum install yum-utils # yumdownloader kernel-2.6.18-53.1.19.el5 # yumdownloader kernel-devel-2.6.18-53.1.19.el5 # yumdownloader kernel-headers-2.6.18-53.1.19.el5 # rpm -ivh kernel-2.6.18-53.1.19.el5.i686.rpm # rpm -ivh kernel-devel-2.6.18-53.1.19.el5.i686.rpm # rpm -ivh kernel-headers-2.6.18-53.1.19.el5.i386.rpm ...and to confirm success, did: # yum list installed | grep kernel reboot, install new VMtools on VI client, and then restart builbot master & slave running on this machine.
Assignee | ||
Comment 79•16 years ago
|
||
(ps: by copying around the 3 rpm files, we can avoid installing yum-utils, and running yumdownloader on each machine.)
Comment 80•16 years ago
|
||
Updated l10n-linux-tbox using the steps in comment #78.
Reporter | ||
Comment 81•16 years ago
|
||
qm-rhel02 had /build go read-only today, it appears to be on netapp-d-fcal1. Rebooted to fix it, had to manually fsck /build. Updated kernel to 2.6.9-67.0.15.EL Restarted the masters, hoping the slaves will bring themselves back up. Added init scripts for auto-restart of masters.
Comment 82•16 years ago
|
||
Updated CentOS-5.0-ref-tools-vm to kernel-2.6.18-53.1.19.el5 using comment #78, except "rpm -Uvh" was needed for the kernel-headers.
Comment 83•16 years ago
|
||
See my comments on https://bugzilla.mozilla.org/show_bug.cgi?id=435134 I didn't see this ticket before I posted that comment, but you are having all of the symptoms we had. This is not a problem at the VM OS level, it has to do with the back-end storage between ESX and your NetApp. When the VM's OS issues a disk read/write out to the [virtual] disk controller, eventually that operation is handed off to the ESX kernel. If ESX kernel fails to handle that request in time or sends back garbage data for a long enough period of time, the read/write operation at the virtual OS level fails and generally gets logged as a 'hardware' error in the VM's logs. In your case, it looks like this is showing up as a Journal write failure on your VM's file system. Your VMs are re-mounting the volume as RO to reduce the risk of corrupting the disk as a result of journal write failures. If you can identify which of your ESX servers was running the VM at the time of the read-only mount, you can match up the VM's logs against the ESX server's logs. Both will report an IO failure at the same time ~90% of the time (see vmkernel logs on the ESX servers). I don't have the iSCSI version of the error message handy, but here is what it looks like with NFS: On Aug 24 at 05:17:02, we had a Disk and Symmpi error in the Windows Event log on one of our VM's. In /var/log/vmkernel: Aug 24 05:17:35 xyz-vmsrv2 vmkernel: 13:13:12:38.762 cpu3:1123)NFSLock: 514: Stop accessing fd 0x621a4fc 4 xyz-vmsrv2 Aug 24 05:17:35 xyz-vmsrv2 vmkernel: 13:13:12:38.762 cpu3:1123)NFSLock: 514: Stop accessing fd 0x621c32c 4 xyz-vmsrv2 There is a hack pseudo-workaround for this at the VM-level (on Linux and on Windows), which is to increase the timeout before the OS concludes that the read or write operation has failed. It will reduce the incidence of the error messages in the logs and prevent the read-only re-mount in most cases, but you're really just covering up the symptom and accepting incredibly poor disk performance. Make sure that this isn't the fix that you're applying. If it is, be aware that the term 'fix' should be used loosely. See my other comment for more info on troubleshooting/possible solutions.
Comment 84•16 years ago
|
||
Updated tb-linux-tbox from 2.6.18-53.1.4.el5 to 2.6.18-51.1.19.el5.
Reporter | ||
Comment 85•16 years ago
|
||
Updated fxdbug-linux-tbox to 2.6.18-53.1.21.el5 after it hit a build error ('encountered NUL character...') today.
Comment 86•16 years ago
|
||
l10n-linux-tbox went r/o in bug 437176, despite being updated to the 2.6.18-53.1.19 kernel (comment #78). First glitch was at Jun 2 19:32:40 l10n-linux-tbox kernel: mptscsih: ioc0: attempting task abort! (sc=e1a43800) according to the system log.
Comment 87•16 years ago
|
||
xr-linux-tbox pulled a similar trick, it's using 2.6.18-53.1.14.el5. Jun 2 19:32:32 xr-linux-tbox kernel: mptscsih: ioc0: attempting task abort! (sc=c2f200c0) Jun 2 19:32:32 xr-linux-tbox kernel: sd 0:0:0:0: Jun 2 19:32:32 xr-linux-tbox kernel: command: Write(10): 2a 00 00 20 00 4f 00 00 18 00 Jun 2 19:32:32 xr-linux-tbox kernel: mptscsih: ioc0: task abort: SUCCESS (sc=c2f200c0) Repeated several times, then later Jun 3 14:54:10 xr-linux-tbox kernel: EXT3-fs error (device sdb1): htree_dirblock_to_tree: bad entry in directory #314816: rec_len is smaller t han minimal - offset=0, inode=2628, rec_len=0, name_len=0 Jun 3 14:54:10 xr-linux-tbox kernel: Aborting journal on device sdb1. Jun 3 14:54:11 xr-linux-tbox kernel: ext3_abort called. Jun 3 14:54:11 xr-linux-tbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal Jun 3 14:54:11 xr-linux-tbox kernel: Remounting filesystem read-only
Reporter | ||
Comment 88•16 years ago
|
||
moz2-linux-slave02 had /builds go read-only today. upgraded kernel from 2.6.18-53.1.14.el5 to 2.6.18-53.1.21.el5 and rebooted.
Comment 89•16 years ago
|
||
Do we want to take the latest kernel at install time, or pick a specific version for consistency across the Farm ?
Reporter | ||
Comment 90•16 years ago
|
||
I installed the latest
Assignee | ||
Comment 91•16 years ago
|
||
qm-rhel02 had its drive go r/o this morning, and it already has kernel: 2.6.9-67.0.15.ELsmp. Justin suspects this is not a kernel problem, and instead is fallout from the ongoing netapp-c woes in bug#435134.
Comment 92•16 years ago
|
||
qm-rhel02 went read only sometime around 11:55, 2008-06-09. Restarted qm-rhel02 and started buildbot masters.
Comment 93•16 years ago
|
||
qm-rhel02 is r/o again - appears to have happened not long after the last reboot/master restart.
Comment 94•16 years ago
|
||
I believe we have a fix implemented. Have had no errors on the netapp or esx hosts for the last 1.5 hours and all storage migrations are working. Would like to monitor overnight to ensure we truly have a fix. Let's bring things back up and watch closely for issues. If this is a fix, we may have to do some minor cleanup, but nothing urgent and can be scheduled.
Assignee | ||
Comment 95•16 years ago
|
||
Noted that staging slave "moz2-linux-slave04" has newer kernel then all production slaves "moz2-linux-slave*". Upgraded moz2-linux-slave06 to see if that explains the http proxy problem we are hitting in bug#430200. $ su - # yum install yum-utils # yumdownloader kernel-2.6.18-92.1.10.el5 # yumdownloader kernel-devel-2.6.18-92.1.10.el5 # yumdownloader kernel-headers-2.6.18-92.1.10.el5 # rpm -ivh kernel-2.6.18-92.1.10.el5.i686.rpm # rpm -ivh kernel-devel-2.6.18-92.1.10.el5.i686.rpm # rpm -ivh kernel-headers-2.6.18-92.1.10.el5.i386.rpm ...and to confirm success, did: # yum list installed | grep kernel kernel.i686 2.6.18-92.1.10.el5 installed kernel.i686 2.6.18-53.1.19.el5 installed kernel-devel.i686 2.6.18-92.1.10.el5 installed kernel-devel.i686 2.6.18-53.1.19.el5 installed kernel-headers.i386 2.6.18-53.1.19.el5 installed reboot VM, install new VMtools on VI client, and then restart builbot slave running on this machine.
Whiteboard: needs scheduled downtime
Assignee | ||
Comment 96•16 years ago
|
||
Updating summary, as we havent hit r/o drive problems since the ESX host problems in bug#435134 were fixed.
Severity: major → normal
Depends on: 435134
Priority: P2 → P3
Summary: update linux VM kernel to fix intermittent "drive mounted ro" problem → update linux VM kernels to same new stable version
Reporter | ||
Comment 97•15 years ago
|
||
I found this bug while looking for another one...it looks like this is all done. All of the moz2-* and try-* linux VMs are on 2.6.18-53.1.19.el5. Going to close this as FIXED...
Status: NEW → RESOLVED
Closed: 17 years ago → 15 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•