430821 - even after kernel update, centos5 VMs hit intermittent "drive mounted ro" problem

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Description

•

16 years ago

We've been hitting this problem since at least Dec2007, and thought it needed us to update the kernel used by our build VMs. Updating the kernels was done in bug#407796.

However, even after updating the kernel of production-1.8-master, this VM still went r/o. See bug#407796#c50. After talking with Justin, we're continuing the rollout of the kernel updates anyway, as its still a good thing to do.

How can this be fixed? Each time this happens, it crashes a running production build system.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 1

•

16 years ago

Is this possibly not just a kernel problem, but instead a problem with an ESX host or a specific shelf?

matthew zeier [:mrz]

Comment 2

•

16 years ago

Probably not - a short sampling of VMs from bug 407796:

production-1.8-master / bm-vmware10 / netapp-c-fcal1
bm-centos5-moz2-01 / bm-vmware06 / netapp-d-fcal1
qm-centos05-02 / qm-vmware01 / netapp-b-vmware

Nothing in common except the network.  I'll file a case with VMware to see if there's anything in the logs that would help.

Assignee: server-ops → mrz

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

16 years ago

production-master (/var and /builds) was hit over the weekend. it was running 2.6.18-8.el5. Now updated to 2.6.18-53.1.14.el5.

matthew zeier [:mrz]

Updated

•

16 years ago

Whiteboard: vmware SR#1111450081

matthew zeier [:mrz]

Comment 4

•

16 years ago

From VMware:

We have a kb regarding this issue. Could you please confirm if the problem
you are facing applies to this KB?

--------------------------------------------------------------------------

RHEL5, RHEL4 U4, RHEL4 U3, SLES10, and SLES9 SP3 File Systems may Become
Read-Only
=============================================================================
=====
Last Modified Date: 04-03-2008ID: 51306

Products
========	
VMware ESX Server

Details
=======	
VMware has identified a problem with RHEL5, RHEL4 U4, RHEL4 U3, SLES10, and
SLES9 SP3 guest operating systems. Their file systems may become read-only in
the event of busy I/O retry or path failover of the ESX Server's SAN or iSCSI
storage.
 
This issue may affect other Linux distributions based on early 2.6 kernels as
well, such as Ubuntu 7.04
 
The same behavior is expected even on a native Linux environment, where the
time required for the file system to become read-only depends on the number
of paths available to a particular target, the multi-path software installed
on the operating system, and whether the failing I/O was to an EXT3 Journal.
However, the problem is aggravated in an ESX Server environment because ESX
Server manages multiple paths to the storage target and provides a single
path to the guest operating system, which effectively reduces the number of
retries done by the guest operating system.

Solution
========	
This is not an ESX Server bug. This Linux kernel bug has been fixed as of
version 2.6.22.
 
Note: This article does not supercede the Guest Operating System Installation
Guide, a guest operating system upgrade may require an ESX Server upgrade as
well.
 
For RHEL5, the resolution is to upgrade to Update 1, also refered to as
RHEL5.1
 
For RHEL4 U3 and RHEL U4, the resolution is to upgrade to Update 5, also
refered to as RHEL 4.5.
 
For SLES10, the resolution is to upgrade to SP1. For more information, see
TID 3584352 - Filesystem goes read-only in VMware
 
For SLES9 SP3, the resolution is to upgrade to SP4, or SP3 Maintenance
Release build 2.6.5-7.286.
For more information, see TID 3584352 - Filesystem goes read-only in VMware
 
For Ubuntu 7.04, the resolution is to upgrade to 7.10

Product Versions
================	
VMware ESX Server 2.5.x
VMware ESX Server 3.0.x
VMware ESX Server 3.5.x

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

16 years ago

We'd want to test that theory to be sure, but that sure sounds like it.

matthew zeier [:mrz]

Updated

•

16 years ago

URL: http://kb.vmware.com/selfservice/dyna...

matthew zeier [:mrz]

Comment 6

•

16 years ago

Change long entry: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ad8c31bb69d60c0c6bc6431bccdf67e5a96c0d31;hp=b364fd5081b02fa8a966a29eea2da628913fd4b8which

This address the issue of VMWare guest OS being remounted as read-only
becuase the underlying device was held busy too long and at the
same time address Engenio MPP driver concerns over infinite retries.
This patch removes the code that snoops the SAM STATUS on busy, which
would be returning DID_BUS_BUSY, instead we return the status as is.
Retry hanlding seems to be properly handled in scsi_softirq_done,
where a busy sam status would only occurr for the time specified by
(cmd->allowed +1) * cmd->timeout_per_command.

matthew zeier [:mrz]

Comment 7

•

16 years ago

Case is closed on VMware's side since it's a known Linux kernel issue.  Passing back to RE.

Assignee: mrz → nobody

Component: Server Operations → Release Engineering

QA Contact: justin → release

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 8

•

16 years ago

Urg. Am I reading this correctly that a busy ESX host will trigger this bug?

Are we able to update centos5 -> centos5.1 at this point in the FF3.0 release? Or should we instead focus on offloading the ESX hosts as a "workaround"?

Priority: -- → P1

bhearsum@mozilla.com (:bhearsum)

Comment 9

•

16 years ago

Can we get more recent kernels for CentOS 5? I bet we could pull a kernel from a newer release.

Even if that's not the case, I'd favour rolling our own kernel over upgrading the entire OS.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 10

•

16 years ago

(In reply to comment #9)
> Can we get more recent kernels for CentOS 5? I bet we could pull a kernel from
> a newer release.
In bug#407796, we already updated to kernel 2.6.18-53.1.13.el5, which is the latest available for centos5. Even with this kernel, we still hit the read-only problem.


> Even if that's not the case, I'd favour rolling our own kernel over upgrading
> the entire OS.
ummm... rolling our own kernel sounds scary to me also! :-)

matthew zeier [:mrz]

Comment 11

•

16 years ago

(In reply to comment #10)
> (In reply to comment #9)
> > Can we get more recent kernels for CentOS 5? I bet we could pull a kernel from
> > a newer release.
> In bug#407796, we already updated to kernel 2.6.18-53.1.13.el5, which is the
> latest available for centos5. Even with this kernel, we still hit the read-only
> problem.

That should be expected - 2.6.18 is less than 2.6.22.

CC'ing justdave - he might have some ideas.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 12

•

16 years ago

Before we go doing this on all our VMs everywhere, I'm going to see if a centos5.1 VM does actually fix the problem. See bug#431608 for details.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

16 years ago

Depends on: 431608

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 13

•

16 years ago

from offline emails:


1) Justin / mrz / aravind are planning to take the Centos5.1 source rpm, and build a 2.6.22 kernel for us which has the fix we need. They'd then try applying it to a volunteer VM to see if it works. The cool part of all this, if it works, is that we can use their kernel, *remain* on Centos5.0, fix the read-only problem, and still be running on a supported configuration.

2) Note: this would not solve the read-only problem for the Centos4 VMs. Will be tackled separately.

3) Justin: we're ok with trying out the new kernel on xr-linux-tbox. Just let us know a few minutes before you want to start, so we can warn folks?



Passing this bug back to IT to track creating new 2.6.22 kernel. Once the new kernel is working on xr-linux-tbox, we'll continue to use bug#407796 to track rolling out of the new kernel on all the build VMs.

Assignee: nobody → server-ops

Blocks: read-only

Component: Release Engineering → Server Operations

QA Contact: release → justin

Justin Fitzhugh

Comment 14

•

16 years ago

over to aravind for compile and testing

Assignee: server-ops → aravind

Aravind Gottipati [:aravind]

Comment 15

•

16 years ago

I am upgrading xr-linux-tbox to the newer kernel.  Let me know if you have any objections to doing this (will do this in about an hour or so).

Aravind Gottipati [:aravind]

Comment 16

•

16 years ago

According to the vmware KB article, the latest rhel kernels shold already have the fix.  Updating the box to the latest rhel/centos kernel - 2.6.18-53.1.14.el5.

Steps to upgrade: yum update kernel kernel-headers

The xr-linux-tbox box originally had 2.6.18-8.el5.  Upgraded that to the above kernel.

Over to build and release.

Assignee: aravind → nobody

Component: Server Operations → Release Engineering

QA Contact: justin → release

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

16 years ago

Attached file Change log for kernel-devel-2.6.18-53.1.14.el5.i686 — Details

Pardon the noise, but I'm trying to find the fix we need using this change log from the latest Centos-5.1 kernel-devel package.

Looks like the fix we want is 

* Thu 14 Jun 2007 Don Zickus <dzickus@redhat.com> [2.6.18-26.el5]
...
- [scsi] mpt adds DID_BUS_BUSY host status on scsi BUSY status (Chip Coldwell ) [228108]

The Redhat bug is
  https://bugzilla.redhat.com/show_bug.cgi?id=228108

Also verified by inspection that 
  linux-2.6-scsi-mpt-adds-did_bus_busy-status-on-scsi-busy.patch
is in kernel-2.6.18-53.1.14.el5.src.rpm and matches the patch at git.kernel.org that mrz provided.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 18

•

16 years ago

Taking this, as I'm already working on the other kernel update bug#407796.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

16 years ago

Assignee: nobody → joduinn

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 19

•

16 years ago

Moving "vmware SR#1111450081" from whiteboard.

Whiteboard: vmware SR#1111450081 → needs scheduled downtime

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 20

•

16 years ago

sorry for bugspam, updated the wrong bug. Reverted.

Whiteboard: needs scheduled downtime → vmware SR#1111450081

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 21

•

16 years ago

For centos5, it looks like 2.6.18-53.1.14.el5 does fix the problem. The VM qm-centos5-02 has been up for just over 48hours now, since updating kernel to 2.6.18-53.1.14.el5. Before this kernel update this VM was going read-only within about 12 hours, for the last few days now. This kernel solution looks good for Centos5, so closing this bug. 

bug#407796 will continue to track rolling out this kernel update across all our VMs.

Filing separate bug#432933 to track solving read-only kernel bug in Centos4.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Summary: even after kernel update, some linux VMs hit intermittent "drive mounted ro" problem → even after kernel update, centos5 VMs hit intermittent "drive mounted ro" problem

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering