Closed Bug 430821 Opened 16 years ago Closed 16 years ago

even after kernel update, centos5 VMs hit intermittent "drive mounted ro" problem

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: joduinn)

References

()

Details

(Whiteboard: vmware SR#1111450081)

Attachments

(1 file)

We've been hitting this problem since at least Dec2007, and thought it needed us to update the kernel used by our build VMs. Updating the kernels was done in bug#407796.

However, even after updating the kernel of production-1.8-master, this VM still went r/o. See bug#407796#c50. After talking with Justin, we're continuing the rollout of the kernel updates anyway, as its still a good thing to do.

How can this be fixed? Each time this happens, it crashes a running production build system.
Is this possibly not just a kernel problem, but instead a problem with an ESX host or a specific shelf?
Probably not - a short sampling of VMs from bug 407796:

production-1.8-master / bm-vmware10 / netapp-c-fcal1
bm-centos5-moz2-01 / bm-vmware06 / netapp-d-fcal1
qm-centos05-02 / qm-vmware01 / netapp-b-vmware

Nothing in common except the network.  I'll file a case with VMware to see if there's anything in the logs that would help. 
Assignee: server-ops → mrz
production-master (/var and /builds) was hit over the weekend. it was running 2.6.18-8.el5. Now updated to 2.6.18-53.1.14.el5.
Whiteboard: vmware SR#1111450081
From VMware:

We have a kb regarding this issue. Could you please confirm if the problem
you are facing applies to this KB?

--------------------------------------------------------------------------

RHEL5, RHEL4 U4, RHEL4 U3, SLES10, and SLES9 SP3 File Systems may Become
Read-Only
=============================================================================
=====
Last Modified Date: 04-03-2008ID: 51306

Products
========	
VMware ESX Server

Details
=======	
VMware has identified a problem with RHEL5, RHEL4 U4, RHEL4 U3, SLES10, and
SLES9 SP3 guest operating systems. Their file systems may become read-only in
the event of busy I/O retry or path failover of the ESX Server's SAN or iSCSI
storage.
 
This issue may affect other Linux distributions based on early 2.6 kernels as
well, such as Ubuntu 7.04
 
The same behavior is expected even on a native Linux environment, where the
time required for the file system to become read-only depends on the number
of paths available to a particular target, the multi-path software installed
on the operating system, and whether the failing I/O was to an EXT3 Journal.
However, the problem is aggravated in an ESX Server environment because ESX
Server manages multiple paths to the storage target and provides a single
path to the guest operating system, which effectively reduces the number of
retries done by the guest operating system.

Solution
========	
This is not an ESX Server bug. This Linux kernel bug has been fixed as of
version 2.6.22.
 
Note: This article does not supercede the Guest Operating System Installation
Guide, a guest operating system upgrade may require an ESX Server upgrade as
well.
 
For RHEL5, the resolution is to upgrade to Update 1, also refered to as
RHEL5.1
 
For RHEL4 U3 and RHEL U4, the resolution is to upgrade to Update 5, also
refered to as RHEL 4.5.
 
For SLES10, the resolution is to upgrade to SP1. For more information, see
TID 3584352 - Filesystem goes read-only in VMware
 
For SLES9 SP3, the resolution is to upgrade to SP4, or SP3 Maintenance
Release build 2.6.5-7.286.
For more information, see TID 3584352 - Filesystem goes read-only in VMware
 
For Ubuntu 7.04, the resolution is to upgrade to 7.10

Product Versions
================	
VMware ESX Server 2.5.x
VMware ESX Server 3.0.x
VMware ESX Server 3.5.x
We'd want to test that theory to be sure, but that sure sounds like it.
Change long entry: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ad8c31bb69d60c0c6bc6431bccdf67e5a96c0d31;hp=b364fd5081b02fa8a966a29eea2da628913fd4b8which

This address the issue of VMWare guest OS being remounted as read-only
becuase the underlying device was held busy too long and at the
same time address Engenio MPP driver concerns over infinite retries.
This patch removes the code that snoops the SAM STATUS on busy, which
would be returning DID_BUS_BUSY, instead we return the status as is.
Retry hanlding seems to be properly handled in scsi_softirq_done,
where a busy sam status would only occurr for the time specified by
(cmd->allowed +1) * cmd->timeout_per_command.
Case is closed on VMware's side since it's a known Linux kernel issue.  Passing back to RE.
Assignee: mrz → nobody
Component: Server Operations → Release Engineering
QA Contact: justin → release
Urg. Am I reading this correctly that a busy ESX host will trigger this bug?

Are we able to update centos5 -> centos5.1 at this point in the FF3.0 release? Or should we instead focus on offloading the ESX hosts as a "workaround"?
Priority: -- → P1
Can we get more recent kernels for CentOS 5? I bet we could pull a kernel from a newer release.

Even if that's not the case, I'd favour rolling our own kernel over upgrading the entire OS.
(In reply to comment #9)
> Can we get more recent kernels for CentOS 5? I bet we could pull a kernel from
> a newer release.
In bug#407796, we already updated to kernel 2.6.18-53.1.13.el5, which is the latest available for centos5. Even with this kernel, we still hit the read-only problem.


> Even if that's not the case, I'd favour rolling our own kernel over upgrading
> the entire OS.
ummm... rolling our own kernel sounds scary to me also! :-)

(In reply to comment #10)
> (In reply to comment #9)
> > Can we get more recent kernels for CentOS 5? I bet we could pull a kernel from
> > a newer release.
> In bug#407796, we already updated to kernel 2.6.18-53.1.13.el5, which is the
> latest available for centos5. Even with this kernel, we still hit the read-only
> problem.

That should be expected - 2.6.18 is less than 2.6.22.

CC'ing justdave - he might have some ideas.
Before we go doing this on all our VMs everywhere, I'm going to see if a centos5.1 VM does actually fix the problem. See bug#431608 for details.
from offline emails:


1) Justin / mrz / aravind are planning to take the Centos5.1 source rpm, and build a 2.6.22 kernel for us which has the fix we need. They'd then try applying it to a volunteer VM to see if it works. The cool part of all this, if it works, is that we can use their kernel, *remain* on Centos5.0, fix the read-only problem, and still be running on a supported configuration.

2) Note: this would not solve the read-only problem for the Centos4 VMs. Will be tackled separately.

3) Justin: we're ok with trying out the new kernel on xr-linux-tbox. Just let us know a few minutes before you want to start, so we can warn folks?



Passing this bug back to IT to track creating new 2.6.22 kernel. Once the new kernel is working on xr-linux-tbox, we'll continue to use bug#407796 to track rolling out of the new kernel on all the build VMs. 
Assignee: nobody → server-ops
Blocks: read-only
Component: Release Engineering → Server Operations
QA Contact: release → justin
over to aravind for compile and testing
Assignee: server-ops → aravind
I am upgrading xr-linux-tbox to the newer kernel.  Let me know if you have any objections to doing this (will do this in about an hour or so).
According to the vmware KB article, the latest rhel kernels shold already have the fix.  Updating the box to the latest rhel/centos kernel - 2.6.18-53.1.14.el5.

Steps to upgrade: yum update kernel kernel-headers

The xr-linux-tbox box originally had 2.6.18-8.el5.  Upgraded that to the above kernel.

Over to build and release.
Assignee: aravind → nobody
Component: Server Operations → Release Engineering
QA Contact: justin → release
Pardon the noise, but I'm trying to find the fix we need using this change log from the latest Centos-5.1 kernel-devel package.

Looks like the fix we want is 

* Thu 14 Jun 2007 Don Zickus <dzickus@redhat.com> [2.6.18-26.el5]
...
- [scsi] mpt adds DID_BUS_BUSY host status on scsi BUSY status (Chip Coldwell ) [228108]

The Redhat bug is
  https://bugzilla.redhat.com/show_bug.cgi?id=228108

Also verified by inspection that 
  linux-2.6-scsi-mpt-adds-did_bus_busy-status-on-scsi-busy.patch
is in kernel-2.6.18-53.1.14.el5.src.rpm and matches the patch at git.kernel.org that mrz provided.
Taking this, as I'm already working on the other kernel update bug#407796.
Assignee: nobody → joduinn
Moving "vmware SR#1111450081" from whiteboard.
Whiteboard: vmware SR#1111450081 → needs scheduled downtime
sorry for bugspam, updated the wrong bug. Reverted. 
Whiteboard: needs scheduled downtime → vmware SR#1111450081
For centos5, it looks like 2.6.18-53.1.14.el5 does fix the problem. The VM qm-centos5-02 has been up for just over 48hours now, since updating kernel to 2.6.18-53.1.14.el5. Before this kernel update this VM was going read-only within about 12 hours, for the last few days now. This kernel solution looks good for Centos5, so closing this bug. 

bug#407796 will continue to track rolling out this kernel update across all our VMs.

Filing separate bug#432933 to track solving read-only kernel bug in Centos4.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Summary: even after kernel update, some linux VMs hit intermittent "drive mounted ro" problem → even after kernel update, centos5 VMs hit intermittent "drive mounted ro" problem
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: