Closed Bug 407796 (read-only) Opened 17 years ago Closed 15 years ago

update linux VM kernels to same new stable version

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: joduinn)

References

Details

It refuses to mount rw:
[root@production-trunk-automation ~]# pwd
/root
[root@production-trunk-automation ~]# mount -o remount,rw /
mount: block device /dev/sda1 is write-protected, mounting read-only
[root@production-trunk-automation ~]# touch foo
touch: cannot touch `foo': Read-only file system
[root@production-trunk-automation ~]# 


This is blocking Firefox 3.0 Beta 2
From dmesg after reboot:

EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1963514
Aborting journal on device sda1.
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda1) in ext3_truncate: Journal has aborted
EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda1) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
ext3_abort called.
EXT3-fs error (device sda1): ext3_remount: Abort forced by user
ext3_abort called.
EXT3-fs error (device sda1): ext3_remount: Abort forced by user
There is nothing named production-trunk-automation in VI, so I can't
troubleshoot this - what's it called in VI?
justin and I confirmed its production-trunk-master on bm-vmware07, before justin got dragged off to something else.
we also confirmed that after it failed out the first time, we rebooted the vm and hit the same error again. Aravind is looking at it now, while Justin is in a meeting.
Assignee: server-ops → aravind
Box is up now.  There are some other reports on lkml about that kernel causing problems like that.  But no-one has seems to have confirmed the actual bug.
Folks tell me we can't upgrade kernels on these boxes willy-nilly so we should probably figure out a good time to ensure that these boxes have the latest and greatest kernels for that centos/rhel release.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
This happened again this morning. Slightly different errors this time:
sd 0:0:0:0: SCSI error: return code = 0x00020008
end_request: I/O error, dev sda, sector 17843020
sd 0:0:0:0: SCSI error: return code = 0x00020008
end_request: I/O error, dev sda, sector 5767279
Buffer I/O error on device sda1, logical block 720902
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00020008
end_request: I/O error, dev sda, sector 5767663
Buffer I/O error on device sda1, logical block 720950
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00020008
end_request: I/O error, dev sda, sector 17843036
Buffer I/O error on device sda2, logical block 5377
lost page write due to I/O error on sda2
sd 0:0:0:0: SCSI error: return code = 0x00020008
end_request: I/O error, dev sda, sector 18537636
Buffer I/O error on device sda2, logical block 92202
lost page write due to I/O error on sda2
Aborting journal on device sda2.
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:0:0:0: SCSI error: return code = 0x00020008
end_request: I/O error, dev sda, sector 5767279
EXT3-fs error (device sda1): ext3_get_inode_loc: unable to read inode block - inode=720331, block=720902
Aborting journal on device sda1.
EXT3-fs error (device sda1) in ext3_reserve_inode_write: IO failure
EXT3-fs error (device sda1) in ext3_dirty_inode: IO failure
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
Severity: blocker → critical
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Rebooted and it fscked again. It's OK now. Do these I/O errors indicate hardware problems on the netapp?
No - if hey did, all the VMs on the netapp would be having issues, and in this case, none of them are.  Aravind found links to this error which referenced issues with the specific kernel you are using, but John asked us to wait to upgrade until after beta.
Severity: critical → major
Please re-open this when we can get together and figure out how to go about upgrading kernels on these buildbot boxes.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → INCOMPLETE
Reopening.

We have shipped beta2, and we have a quiet period between releases, so now would be a good time to do this.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Is this going to be just a kernel update ? If the nightly and release boxes are diverging (by updating the later from the reference platform) we ought to consciously OK any tool-chain changes.
(In reply to comment #12)
> Is this going to be just a kernel update ? If the nightly and release boxes are
> diverging (by updating the later from the reference platform) we ought to
> consciously OK any tool-chain changes.

This bug is only about the consoles, right? (build-console, production-trunk-automation, and the respective staging boxes)
So it is, I was remembering the similar problems we've occasionally had with xr-linux-tbox and l10n-linux-tbox.
Yup, to begin with I just want a kernel upgrade.  I would like to upgrade to the latest and greatest stable 2.6 kernel (if it works).  Which boxes can I do this on?

Or would you guys want to handle these upgrades yourself?
Guys, any update on this?
staging-trunk-automation needs this soon; it got mounted ro over the weekend again. If you're around tomorrow that would be a good time. If not, let's do it Wednesday.

It's OK to shutdown/reboot this machine when you're ready, please give me a day/time though.

Other machines need this too, but maybe we should wait until we see how this one goes to do them?
Sorry Aravind; I forgot to update this bug with the machine names after our chat-walking-to-lunch. The four buildmasters which should get new kernel upgrade are:

staging-trunk-automation.build.mozilla.org
production-trunk-automation.build.mozilla.org
staging-build-console.build.mozilla.org 
build-console.build.mozilla.org 

How about starting with the two staging VMs first, and if all that goes ok, then we can do the other two VMs? Anytime that suits you is fine with us, just ping us on IRC before you start, so we can stop any work while you upgrade, ok?
I can do the upgrade, sometime today if thats okay with you guys.  I agree with bhearsum that we should probably just do this one to see if it helps before we upgrade others.

But, if you guys want, I can upgrade the rest of them as well.
staging-build-console is free, if you want to upgrade that now, go for it!? 

Please hold off upgrading staging-trunk-automation right now, as its currently running a test which will take another 8 hours to complete. We'll update this bug as soon as its done.

After the two staging VMs are upgraded, how about we test them - if all tests look ok, we'll give the goahead to update the production VMs in a day or two?

Does that sound ok to you?
staging-build-console upgraded from 2.6.9-42.ELsmp to 2.6.9-67.0.1.ELsmp.
Looks like fx-linux-tbox had this problem recently, bug 410386. We should get it upgraded at some point, too.
(In reply to comment #22)
> Looks like fx-linux-tbox had this problem recently, bug 410386. We should get
> it upgraded at some point, too.

justdave tried a new kernel, and it had the same problem.
Given build owns the OS/build environment, I'd rather you guys take over the kernel updates (given we don't handle any security updates for these machines) to avoid delays and back and forth of coordinating schedules.  Also, I've been told having a consistent and reliable build environment is paramount, so I think build should handle any major changes such as kernel updates rather than us changing things...

Aravind - can you document the process for updating the kernels in this bug?  Once done, we'll move the bug to the build queue.  If the new kernel proves not to be the solution, server-ops should keep it until a solution is found, then hand it off to build for the mass changes.

(In reply to comment #24)
> Aravind - can you document the process for updating the kernels in this bug? 
> Once done, we'll move the bug to the build queue.  If the new kernel proves not
> to be the solution, server-ops should keep it until a solution is found, then
> hand it off to build for the mass changes.
> 

Sure, all I did was to run "yum update kernel kernel-smp".  That will install the new kernels.  If you are running vmware guest tools, then you'd need to console in and run vmware-config-tools.pl which will re-do your interfaces for you.  Let me know if you need more information.
fx-linux-tbox has a newer kernel available (2.6.18-53.1.4.el5).  I can upgrade this box whenever you guys are ready.
I don't think it's too urgent to do this. Whenever the next downtime/tree closure is seems like a good time to take care of it.
/var on staging-trunk-automation got mounted read-only after the kernel upgrade was done. I _think_ the kernel upgrade helped a bit, but it's hard to measure this.
Given that this one has already recreated the failing read-only case, not sure its worth rolling out to other machines. 

Is there another/newer kernel patch that might be worth trying? 
(In reply to comment #29)
> Given that this one has already recreated the failing read-only case, not sure
> its worth rolling out to other machines. 
> 
> Is there another/newer kernel patch that might be worth trying? 
> 

Which box are you talking about?  I upgraded staging-build-console.
For record keeping, tb-linux-tbox had /builds go ro today (bug 412412).
(In reply to comment #28)
> /var on staging-trunk-automation got mounted read-only after the kernel upgrade
> was done. I _think_ the kernel upgrade helped a bit, but it's hard to measure
> this.
Bhearsum and I talked about this last night, and cannot confirm if the read-only problem was actually seen on staging-trunk-automation (kernel not upgraded) or on staging-build-console (kernel upgraded). Therefore, its too early to rule out if the new kernel fixed the problem.

Lets watch staging-build-console to see if it happens again.
(In reply to comment #31)
> For record keeping, tb-linux-tbox had /builds go ro today (bug 412412).
fyi: Aravind confirmed last night that tb-linux-tbox did not get this kernel update.
(In reply to comment #33)
> (In reply to comment #31)
> > For record keeping, tb-linux-tbox had /builds go ro today (bug 412412).
> fyi: Aravind confirmed last night that tb-linux-tbox did not get this kernel
> update.
> 

Given the instructions are in the bug - can you upgrade and see if you still see the r/o issue?  Let's track which machines have the new kernel here - so far staging-build-console is the only one so far, and no confirmation of if the issue still exists.
Talked to John about this.  I am moving this back to build and release, since at this point this is just a waiting game.  B&R has the instructions on how to update the kernels.

If they find that the new kernels don't fix the problems, please move it back to us and I will look into compiling or getting pre-built newer kernels.
Assignee: aravind → nobody
Status: REOPENED → NEW
Component: Server Operations → Build & Release
QA Contact: justin → build
Assignee: nobody → joduinn
Priority: -- → P2
staging-build-console.build.mozilla.org was renamed to staging-1.8-master (already updated by Aravind).

staging-trunk-automation.build.mozilla.org was renamed to staging-1.9-master
production-trunk-automation.build.mozilla.org was renamed to production-1.9-master
build-console.build.mozilla.org was renamed to production-1.8-master

I'll start doing these 3 today.
Status: NEW → ASSIGNED
I've updated staging-1.9-master to 2.6.18-53.1.6.el5.
staging-trunk-automation.build.mozilla.org / staging-1.9-master is now updated to 2.6.18-53.1.4.el5, and rebooted.

Note: we also had to do "yum install kernel-headers-2.6.18-53.1.4.el5". Thanks to bhearsum for helping with updating the kernel-source-headers, needed for rebuilding vmware packages, as part of this kernel update.
(In reply to comment #38)
> Note: we also had to do "yum install kernel-headers-2.6.18-53.1.4.el5". Thanks
> to bhearsum for helping with updating the kernel-source-headers, needed for
> rebuilding vmware packages, as part of this kernel update.
> 

Actually, I installed kernel-headers.i386 and kernel-devel.i686. kernel-devel seems to be the necessary one for vmware tools.
to clarify: 

joduinn installed kernel-headers-2.6.18-53.1.4.el5
bhearsum installed kernel-headers.i386 and kernel-devel.i686

No longer sure which one was "needed", but wanted to clarify what exactly was done.
For record keeping, fx-linux-tbox had /var turn "ro" today.. see details in bug#18471.
As "su", I did the following:
# yum update kernel
# yum update kernel-headers
# yum update kernel-devel
# shutdown -r now

...updating each of the following:
staging-1.9-master (centos5) now at 2.6.18-53.1.13.el5
production-1.9-master (centos5) now at 2.6.18-53.1.13.el5
staging-1.8-master (centos4) now at 2.6.9-67.0.1.EL
production-1.8-master (centos4) now at 2.6.9-67.0.1.EL

...and confirmed each buildbot master seems to be up and pinging to slaves ok.
Updated the following:

fx-linux-1.9-slave1 (centos5) now at 2.6.18-53.1.13.el5
fx-linux-1.9-slave2 (centos5) now at 2.6.18-53.1.13.el5

...and confirm each buildbot slave up and pinging master ok. 
As "su", I did the following:
# up2date -f kernel
# shutdown -r now

For this older version of RedHat, there were no "kernel-headers" or "kernel-devel". I updated the following VMs:

staging-prometheus-vm (RedHat Ent3update8) now at  2.4.21-27.0.4.EL 
production-prometheus-vm (RedHat Ent3update8) now at  2.4.21-27.0.4.EL 

...and confirmed each buildbot slave is up and pinging master ok. 
As "su", I did the following:
# yum update kernel
# yum update kernel-headers
# yum update kernel-devel
# shutdown -r now

...updating each of the following:
fx-linux-tbox (centos5) now at 2.6.18-53.1.13.el5
fxdbug-linux-tbox (centos5) now at 2.6.18-53.1.13.el5

...and confirmed tinderbox processes are up and running ok. 

Note: during the fxdbug-linux-tbux update, it initially refused to reboot. After help from justdave, we discovered two things:
1) the instructions for this kernel update missed needing to install vmware tools on the VMs. I will have to go back to each upgraded machine and confirm this is in place.
2) the setup of fxdbug-linux-tbox is different to all the other machines, in how rc.local was setup. This caused problems when trying to reboot after the kernel update. Its unclear why this machine is different to all others. Its unclear what is the "right" setting. See details in bug#420007
1) Yes they did.  The only ones I see in this bug do anyway.  See comment 25.

2) Do the others have that rc.local hack already?  If not, there's nothing any different.  The rc.local hack wouldn't work in your case anyway because the new kernel was built with a different gcc than the one that's on the system, do the vmware-config-tools.pl will refuse to install anyway without a manual override.
xr-linux-tbox needed restarting for r/o partitions today.
Assignee: joduinn → nobody
Status: ASSIGNED → NEW
Component: Build & Release → Release Engineering
QA Contact: build → release
Summary: production-trunk-automation VM's / drive is being mounted ro → update linux VM kernel to fix intermittent "nfs drive mounted ro" problem
Assignee: nobody → joduinn
Just ran into this on the newly minted fx-linux-1.9-slave1, we need to get this kernel upgrade into the ref platform (if indeed it helps this problem).

(In reply to comment #47)
> xr-linux-tbox needed restarting for r/o partitions today.

Was this one upgraded?
No, we stopped after hitting the VMtools problem in comment#45 and comment#46. We'll need to figure the VMtools problem out before we can resume.
Component: Release Engineering: Talos → Release Engineering
production-1.8-master had /data (/dev/sdd1) go r/o today. It has an updated kernel per comment #42.
sm-try2-linux-slave was also hit. I updated the kernel and rebooted it.
Alias: read-only
Summary: update linux VM kernel to fix intermittent "nfs drive mounted ro" problem → update linux VM kernel to fix intermittent "drive mounted ro" problem
fx-linux-1.9-slave1 (the new one) was hit today. Upgraded the kernel and rebooted it.
sm-try1-linux-slave was hit too.
fyi, these were all on netapp-c-fcal1.
robcee hit the same r/o problem this morning with:

qm-moz2-unittest01
qm-moz2-centos5-01

...so he upgraded the kernel on both of these.
(In reply to comment #50)
> production-1.8-master had /data (/dev/sdd1) go r/o today. It has an updated
> kernel per comment #42.

Nick: yikes.

Aravind, mrz, Justin: The production-1.8-master VM was already updated with the kernel that supposedly fixed these read-only problems. Now what? Should I continue rolling out this kernel update, or is there something else causing this problem? Is production-1.8-master also on netapp-c-fcall, and if so, could there be a problem with that netapp?
Justin and myself talked earlier this week, didnt get a chance to update this bug until now. 

1) I should continue with the kernel rollout. Its fixing known kernel bugs anyway, and a good habit to do. For comparison, IT does this approx once a quarter across all their own systems out of proactive maintenance.

2) We should file a separate bug for IT to track this intermittent read-only issue. It possibly not just a VM-kernel issue. It might also be a problem with a specific ESX host or a specific shelf. I've now filed bug#430821.
Two r/o failings today:

bm-centos5-moz2-01
qm-centos05-02

I'm updating the kernel on both now, and will update qm-centos5-03 at the same time as a preventative measure.
qm-centos5-02 failed to boot after coop's kernel update this afternoon. Possible compiler differences while rebuilding vmtools? Kernel update was reverted by IT to stop qm-centos5-02 burning on tinderbox. For details, see bug#430820.
Depends on: 430820
production-master was hit over the weekend - it was not running the new kernel. bug 431136
Rebuilt the vmware modules on fx-linux-1.9-slave2 and rebooted, as the clock was drifting without the time sync. As per usual, ignored the warning about the kernel being compiled with gcc 4.1.2 while 4.1.1 is installed, and the failure to create the vmhgfs module (for the Shared folders feature that we don't use).
This morning, qm-centos5-02 was in readonly mode. Restarted. See Bug#432012 for details.
qm-centos5-02 is read-only again this morning.
Updated the kernel on qm-centos5-02 as follows:

$ uname -a 
Linux qm-centos5-02.mozilla.org 2.6.18-8.el5 #1 SMP Thu Mar 15 19:57:35 EDT 2007 i686 i686 i386 GNU/Linux

I've now updated the kernel using the following steps:
As "su", I did the following:
# yum update kernel kernel-headers kernel-devel
# shutdown -r now

...and after reboot see:
$ uname -a
Linux qm-centos5-02.mozilla.org 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 2008 i686 i686 i386 GNU/Linux

Finally, in VI client, noticed "VMware Tools: out of date" for qm-centos5-02. With mrz's supervision, I went into VI client, right click'd on the hostname, and picked "Install/Upgrade VMware tools", clicked ok, waited a couple of minutes to complete. Once completed, rebooted VM and confirmed that it now shows "VMware Tools: OK". 
I didn't think the kernel version in the yum repo included the fix we needed for this?
according to vmware, it was back ported to centos/rhel 5.1, which is the kernel rev you are pulling, so you should be set.
sorry for the bugnoise. We're tracking these kernel upgrades in a few bugs now so it's a bit confusing. Seems to have fixed it for now on qm-centos5-02 though it's still failing for some other reason.
Whiteboard: needs scheduled downtime
After all the back-and-forth, along with updating kernels, we're now restarting this. 

For any centos4 VM, we're holding for dependent bug#432933.

For any centos5 VM, we'll see if its running kernel 2.6.18-53.1.14.el5 (the newest available right now), which has the read-only fix. If so, great. Otherwise, if the VM is running an older kernel, we'll update it to 2.6.18-53.1.14.el5, using the instructions in comment#64.
staging-master.build.mozilla.org was on kernel 2.6.18-53.1.6.el5, and is now updated to kernel 2.6.18-53.1.19.el5. VM rebooted, and VMware tools updated. 
qm-centos5-01 was 2.6.18-8.el5
qm-centos5-02 was 2.6.18-53.1.14.el5
qm-centos5-03 was 2.6.18-53.1.14.el5

All three now 2.6.18-53.1.19.el5 - also updated vmware tools on qm-centos5-02.

I used
 yum update {kernel,kernel-devel,kernel-headers}-2.6.18-53.1.19.el5
to force the package version (so that we get the same version everywhere and aren't vulnerable to other kernel updates being released while completing our rollout).

qm-moz2-unittest01 was 2.6.18-8.el5

now 2.6.18-53.1.19.el5

as per method above. updated vmware tools as well.
... not true.
cloned copy of qm-centos5-moz2-01, qm-moz2-unittest01 is 2.6.18-53.1.14.el5.
staging-1.8-master went read-only on /, /builds, and /var this morning. When i was looking for a new kernel (before rebooting) I got a bus error:
[root@staging-build-console /]# yum update kernel kernel-smp
Bus error

When I rebooted it, I had to manually fsck /builds before it would boot up.

This is a CentOS 4.4 machine, so we only got up to 2.6.9-67.0.15.EL (CentOS 4.4). Nick tells me we need CentOS 4.5 to fix the problem here.
I don't think we (at least IT) has verified which kernel rev the vmware fix is in on the 4.x train.  5.x has been well documented - I'll see if I can't get an answer out of vmware today given aravind is out.
I was poking around on a Centos mirror and found 
  http://www.centos.org/modules/smartfaq/faq.php?faqid=34

So it looks like updating the kernel to the "latest" will get us the latest-for-4.x rather than 4.4. They seem to be up to Centos 4.6 now, so we should be beyond the Centos 4 Update 5 requirement in the VMWare doc.
sounds good - looks like we are good to go with the kernel from 4.6.  if possible, please post the kernel rev you settle on so we have docs on what all the machines should be at.
Tried to update staging-1.9-master, following steps in comment#70. However, as a new kernel (.1.21) is now available, these instructions dont work anymore. 

Updated staging-1.9-master with instructions that justdave came up with; these worked just great:

$ su -
# yum install yum-utils
# yumdownloader kernel-2.6.18-53.1.19.el5
# yumdownloader kernel-devel-2.6.18-53.1.19.el5
# yumdownloader kernel-headers-2.6.18-53.1.19.el5
# rpm -ivh kernel-2.6.18-53.1.19.el5.i686.rpm
# rpm -ivh kernel-devel-2.6.18-53.1.19.el5.i686.rpm
# rpm -ivh kernel-headers-2.6.18-53.1.19.el5.i386.rpm
...and to confirm success, did:
# yum list installed | grep kernel

reboot, install new VMtools on VI client, and then restart builbot master & slave running on this machine.
(ps: by copying around the 3 rpm files, we can avoid installing yum-utils, and running yumdownloader on each machine.)
Updated l10n-linux-tbox using the steps in comment #78.
qm-rhel02 had /build go read-only today, it appears to be on netapp-d-fcal1. Rebooted to fix it, had to manually fsck /build.
Updated kernel to 2.6.9-67.0.15.EL
Restarted the masters, hoping the slaves will bring themselves back up.
Added init scripts for auto-restart of masters.
Updated CentOS-5.0-ref-tools-vm to kernel-2.6.18-53.1.19.el5 using comment #78, except "rpm -Uvh" was needed for the kernel-headers.
See my comments on https://bugzilla.mozilla.org/show_bug.cgi?id=435134
I didn't see this ticket before I posted that comment, but you are having all of the symptoms we had.

This is not a problem at the VM OS level, it has to do with the back-end storage between ESX and your NetApp.  When the VM's OS issues a disk read/write out to the [virtual] disk controller, eventually that operation is handed off to the ESX kernel.  If ESX kernel fails to handle that request in time or sends back garbage data for a long enough period of time, the read/write operation at the virtual OS level fails and generally gets logged as a 'hardware' error in the VM's logs.  In your case, it looks like this is showing up as a Journal write failure on your VM's file system.  Your VMs are re-mounting the volume as RO to reduce the risk of corrupting the disk as a result of journal write failures.

If you can identify which of your ESX servers was running the VM at the time of the read-only mount, you can match up the VM's logs against the ESX server's logs.  Both will report an IO failure at the same time ~90% of the time (see vmkernel logs on the ESX servers).  I don't have the iSCSI version of the error message handy, but here is what it looks like with NFS: 

On Aug 24 at 05:17:02, we had a Disk and Symmpi error in the Windows Event log on one of our VM's.  

In /var/log/vmkernel:
Aug 24 05:17:35 xyz-vmsrv2 vmkernel: 13:13:12:38.762 cpu3:1123)NFSLock: 514: Stop accessing fd 0x621a4fc  4	xyz-vmsrv2
Aug 24 05:17:35 xyz-vmsrv2 vmkernel: 13:13:12:38.762 cpu3:1123)NFSLock: 514: Stop accessing fd 0x621c32c  4	xyz-vmsrv2

There is a hack pseudo-workaround for this at the VM-level (on Linux and on Windows), which is to increase the timeout before the OS concludes that the read or write operation has failed.  It will reduce the incidence of the error messages in the logs and prevent the read-only re-mount in most cases, but you're really just covering up the symptom and accepting incredibly poor disk performance.  Make sure that this isn't the fix that you're applying.  If it is, be aware that the term 'fix' should be used loosely.

See my other comment for more info on troubleshooting/possible solutions.
Updated tb-linux-tbox from 2.6.18-53.1.4.el5 to 2.6.18-51.1.19.el5.
Updated fxdbug-linux-tbox to 2.6.18-53.1.21.el5 after it hit a build error ('encountered NUL character...') today.
l10n-linux-tbox went r/o in bug 437176, despite being updated to the 2.6.18-53.1.19 kernel (comment #78). 

First glitch was at 
Jun  2 19:32:40 l10n-linux-tbox kernel: mptscsih: ioc0: attempting task abort! (sc=e1a43800)
according to the system log.
xr-linux-tbox pulled a similar trick, it's using  2.6.18-53.1.14.el5.

Jun  2 19:32:32 xr-linux-tbox kernel: mptscsih: ioc0: attempting task abort! (sc=c2f200c0)
Jun  2 19:32:32 xr-linux-tbox kernel: sd 0:0:0:0: 
Jun  2 19:32:32 xr-linux-tbox kernel:         command: Write(10): 2a 00 00 20 00 4f 00 00 18 00
Jun  2 19:32:32 xr-linux-tbox kernel: mptscsih: ioc0: task abort: SUCCESS (sc=c2f200c0)

Repeated several times, then later
Jun  3 14:54:10 xr-linux-tbox kernel: EXT3-fs error (device sdb1): htree_dirblock_to_tree: bad entry in directory #314816: rec_len is smaller t
han minimal - offset=0, inode=2628, rec_len=0, name_len=0
Jun  3 14:54:10 xr-linux-tbox kernel: Aborting journal on device sdb1.
Jun  3 14:54:11 xr-linux-tbox kernel: ext3_abort called.
Jun  3 14:54:11 xr-linux-tbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
Jun  3 14:54:11 xr-linux-tbox kernel: Remounting filesystem read-only
moz2-linux-slave02 had /builds go read-only today.

upgraded kernel from 2.6.18-53.1.14.el5 to 2.6.18-53.1.21.el5 and rebooted.
Do we want to take the latest kernel at install time, or pick a specific version for consistency across the Farm ?
I installed the latest
qm-rhel02 had its drive go r/o this morning, and it already has kernel:  2.6.9-67.0.15.ELsmp. 

Justin suspects this is not a kernel problem, and instead is fallout from the ongoing netapp-c woes in bug#435134.
qm-rhel02 went read only sometime around 11:55, 2008-06-09. Restarted qm-rhel02 and started buildbot masters.
qm-rhel02 is r/o again - appears to have happened not long after the last reboot/master restart.
I believe we have a fix implemented.  Have had no errors on the netapp or esx hosts for the last 1.5 hours and all storage migrations are working.  Would like to monitor overnight to ensure we truly have a fix.  Let's bring things back up and watch closely for issues.

If this is a fix, we may have to do some minor cleanup, but nothing urgent and can be scheduled.
Noted that staging slave "moz2-linux-slave04" has newer kernel then all production slaves "moz2-linux-slave*". Upgraded moz2-linux-slave06 to see if that explains the http proxy problem we are hitting in bug#430200.

$ su -
# yum install yum-utils
# yumdownloader kernel-2.6.18-92.1.10.el5
# yumdownloader kernel-devel-2.6.18-92.1.10.el5
# yumdownloader kernel-headers-2.6.18-92.1.10.el5
# rpm -ivh kernel-2.6.18-92.1.10.el5.i686.rpm 
# rpm -ivh kernel-devel-2.6.18-92.1.10.el5.i686.rpm 
# rpm -ivh kernel-headers-2.6.18-92.1.10.el5.i386.rpm 
...and to confirm success, did:
# yum list installed | grep kernel
kernel.i686                              2.6.18-92.1.10.el5     installed       
kernel.i686                              2.6.18-53.1.19.el5     installed       
kernel-devel.i686                        2.6.18-92.1.10.el5     installed       
kernel-devel.i686                        2.6.18-53.1.19.el5     installed       
kernel-headers.i386                      2.6.18-53.1.19.el5     installed  

reboot VM, install new VMtools on VI client, and then restart builbot slave running on this machine.
Whiteboard: needs scheduled downtime
Updating summary, as we havent hit r/o drive problems since the ESX host problems in bug#435134 were fixed.
Severity: major → normal
Depends on: 435134
Priority: P2 → P3
Summary: update linux VM kernel to fix intermittent "drive mounted ro" problem → update linux VM kernels to same new stable version
I found this bug while looking for another one...it looks like this is all done. All of the moz2-* and try-* linux VMs are on 2.6.18-53.1.19.el5. Going to close this as FIXED...
Status: NEW → RESOLVED
Closed: 17 years ago15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.