407796 - (read-only) update linux VM kernels to same new stable version

Reporter

Description

•

18 years ago

It refuses to mount rw: [root@production-trunk-automation ~]# pwd /root [root@production-trunk-automation ~]# mount -o remount,rw / mount: block device /dev/sda1 is write-protected, mounting read-only [root@production-trunk-automation ~]# touch foo touch: cannot touch `foo': Read-only file system [root@production-trunk-automation ~]# This is blocking Firefox 3.0 Beta 2

Robert Helmer [:rhelmer]

Comment 1

•

18 years ago

From dmesg after reboot: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1963514 Aborting journal on device sda1. ext3_abort called. EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda1) in ext3_truncate: Journal has aborted EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda1) in ext3_orphan_del: Journal has aborted EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data ext3_abort called. EXT3-fs error (device sda1): ext3_remount: Abort forced by user ext3_abort called. EXT3-fs error (device sda1): ext3_remount: Abort forced by user

Justin Fitzhugh

Comment 2

•

18 years ago

There is nothing named production-trunk-automation in VI, so I can't troubleshoot this - what's it called in VI?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 3

•

18 years ago

justin and I confirmed its production-trunk-master on bm-vmware07, before justin got dragged off to something else.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 4

•

18 years ago

we also confirmed that after it failed out the first time, we rebooted the vm and hit the same error again. Aravind is looking at it now, while Justin is in a meeting.

Aravind Gottipati [:aravind]

Updated

•

18 years ago

Assignee: server-ops → aravind

Aravind Gottipati [:aravind]

Comment 5

•

18 years ago

Box is up now. There are some other reports on lkml about that kernel causing problems like that. But no-one has seems to have confirmed the actual bug.

Aravind Gottipati [:aravind]

Comment 6

•

18 years ago

Folks tell me we can't upgrade kernels on these boxes willy-nilly so we should probably figure out a good time to ensure that these boxes have the latest and greatest kernels for that centos/rhel release.

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 7

•

18 years ago

This happened again this morning. Slightly different errors this time: sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 17843020 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 5767279 Buffer I/O error on device sda1, logical block 720902 lost page write due to I/O error on sda1 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 5767663 Buffer I/O error on device sda1, logical block 720950 lost page write due to I/O error on sda1 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 17843036 Buffer I/O error on device sda2, logical block 5377 lost page write due to I/O error on sda2 sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 18537636 Buffer I/O error on device sda2, logical block 92202 lost page write due to I/O error on sda2 Aborting journal on device sda2. journal commit I/O error ext3_abort called. EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only sd 0:0:0:0: SCSI error: return code = 0x00020008 end_request: I/O error, dev sda, sector 5767279 EXT3-fs error (device sda1): ext3_get_inode_loc: unable to read inode block - inode=720331, block=720902 Aborting journal on device sda1. EXT3-fs error (device sda1) in ext3_reserve_inode_write: IO failure EXT3-fs error (device sda1) in ext3_dirty_inode: IO failure ext3_abort called. EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only

Severity: blocker → critical

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

18 years ago

Rebooted and it fscked again. It's OK now. Do these I/O errors indicate hardware problems on the netapp?

Justin Fitzhugh

Comment 9

•

18 years ago

No - if hey did, all the VMs on the netapp would be having issues, and in this case, none of them are. Aravind found links to this error which referenced issues with the specific kernel you are using, but John asked us to wait to upgrade until after beta.

Severity: critical → major

Aravind Gottipati [:aravind]

Comment 10

•

18 years ago

Please re-open this when we can get together and figure out how to go about upgrading kernels on these buildbot boxes.

Status: REOPENED → RESOLVED

Closed: 18 years ago → 18 years ago

Resolution: --- → INCOMPLETE

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 11

•

18 years ago

Reopening. We have shipped beta2, and we have a quiet period between releases, so now would be a good time to do this.

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

18 years ago

Is this going to be just a kernel update ? If the nightly and release boxes are diverging (by updating the later from the reference platform) we ought to consciously OK any tool-chain changes.

Robert Helmer [:rhelmer]

Comment 13

•

18 years ago

(In reply to comment #12) > Is this going to be just a kernel update ? If the nightly and release boxes are > diverging (by updating the later from the reference platform) we ought to > consciously OK any tool-chain changes. This bug is only about the consoles, right? (build-console, production-trunk-automation, and the respective staging boxes)

Nick Thomas [:nthomas] (UTC+12)

Comment 14

•

18 years ago

So it is, I was remembering the similar problems we've occasionally had with xr-linux-tbox and l10n-linux-tbox.

Aravind Gottipati [:aravind]

Comment 15

•

18 years ago

Yup, to begin with I just want a kernel upgrade. I would like to upgrade to the latest and greatest stable 2.6 kernel (if it works). Which boxes can I do this on? Or would you guys want to handle these upgrades yourself?

Aravind Gottipati [:aravind]

Comment 16

•

18 years ago

Guys, any update on this?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 17

•

18 years ago

staging-trunk-automation needs this soon; it got mounted ro over the weekend again. If you're around tomorrow that would be a good time. If not, let's do it Wednesday. It's OK to shutdown/reboot this machine when you're ready, please give me a day/time though. Other machines need this too, but maybe we should wait until we see how this one goes to do them?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 18

•

18 years ago

Sorry Aravind; I forgot to update this bug with the machine names after our chat-walking-to-lunch. The four buildmasters which should get new kernel upgrade are: staging-trunk-automation.build.mozilla.org production-trunk-automation.build.mozilla.org staging-build-console.build.mozilla.org build-console.build.mozilla.org How about starting with the two staging VMs first, and if all that goes ok, then we can do the other two VMs? Anytime that suits you is fine with us, just ping us on IRC before you start, so we can stop any work while you upgrade, ok?

Aravind Gottipati [:aravind]

Comment 19

•

18 years ago

I can do the upgrade, sometime today if thats okay with you guys. I agree with bhearsum that we should probably just do this one to see if it helps before we upgrade others. But, if you guys want, I can upgrade the rest of them as well.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 20

•

18 years ago

staging-build-console is free, if you want to upgrade that now, go for it!? Please hold off upgrading staging-trunk-automation right now, as its currently running a test which will take another 8 hours to complete. We'll update this bug as soon as its done. After the two staging VMs are upgraded, how about we test them - if all tests look ok, we'll give the goahead to update the production VMs in a day or two? Does that sound ok to you?

Aravind Gottipati [:aravind]

Comment 21

•

18 years ago

staging-build-console upgraded from 2.6.9-42.ELsmp to 2.6.9-67.0.1.ELsmp.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 22

•

18 years ago

Looks like fx-linux-tbox had this problem recently, bug 410386. We should get it upgraded at some point, too.

Reed Loden [:reed]

Comment 23

•

18 years ago

(In reply to comment #22) > Looks like fx-linux-tbox had this problem recently, bug 410386. We should get > it upgraded at some point, too. justdave tried a new kernel, and it had the same problem.

Justin Fitzhugh

Comment 24

•

18 years ago

Given build owns the OS/build environment, I'd rather you guys take over the kernel updates (given we don't handle any security updates for these machines) to avoid delays and back and forth of coordinating schedules. Also, I've been told having a consistent and reliable build environment is paramount, so I think build should handle any major changes such as kernel updates rather than us changing things... Aravind - can you document the process for updating the kernels in this bug? Once done, we'll move the bug to the build queue. If the new kernel proves not to be the solution, server-ops should keep it until a solution is found, then hand it off to build for the mass changes.

Aravind Gottipati [:aravind]

Comment 25

•

18 years ago

(In reply to comment #24) > Aravind - can you document the process for updating the kernels in this bug? > Once done, we'll move the bug to the build queue. If the new kernel proves not > to be the solution, server-ops should keep it until a solution is found, then > hand it off to build for the mass changes. > Sure, all I did was to run "yum update kernel kernel-smp". That will install the new kernels. If you are running vmware guest tools, then you'd need to console in and run vmware-config-tools.pl which will re-do your interfaces for you. Let me know if you need more information.

Aravind Gottipati [:aravind]

Comment 26

•

18 years ago

fx-linux-tbox has a newer kernel available (2.6.18-53.1.4.el5). I can upgrade this box whenever you guys are ready.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 27

•

18 years ago

I don't think it's too urgent to do this. Whenever the next downtime/tree closure is seems like a good time to take care of it.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 28

•

18 years ago

/var on staging-trunk-automation got mounted read-only after the kernel upgrade was done. I _think_ the kernel upgrade helped a bit, but it's hard to measure this.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 29

•

18 years ago

Given that this one has already recreated the failing read-only case, not sure its worth rolling out to other machines. Is there another/newer kernel patch that might be worth trying?

Aravind Gottipati [:aravind]

Comment 30

•

18 years ago

(In reply to comment #29) > Given that this one has already recreated the failing read-only case, not sure > its worth rolling out to other machines. > > Is there another/newer kernel patch that might be worth trying? > Which box are you talking about? I upgraded staging-build-console.

Nick Thomas [:nthomas] (UTC+12)

Comment 31

•

18 years ago

For record keeping, tb-linux-tbox had /builds go ro today (bug 412412).

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 32

•

18 years ago

(In reply to comment #28) > /var on staging-trunk-automation got mounted read-only after the kernel upgrade > was done. I _think_ the kernel upgrade helped a bit, but it's hard to measure > this. Bhearsum and I talked about this last night, and cannot confirm if the read-only problem was actually seen on staging-trunk-automation (kernel not upgraded) or on staging-build-console (kernel upgraded). Therefore, its too early to rule out if the new kernel fixed the problem. Lets watch staging-build-console to see if it happens again.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 33

•

18 years ago

(In reply to comment #31) > For record keeping, tb-linux-tbox had /builds go ro today (bug 412412). fyi: Aravind confirmed last night that tb-linux-tbox did not get this kernel update.

Justin Fitzhugh

Comment 34

•

18 years ago

(In reply to comment #33) > (In reply to comment #31) > > For record keeping, tb-linux-tbox had /builds go ro today (bug 412412). > fyi: Aravind confirmed last night that tb-linux-tbox did not get this kernel > update. > Given the instructions are in the bug - can you upgrade and see if you still see the r/o issue? Let's track which machines have the new kernel here - so far staging-build-console is the only one so far, and no confirmation of if the issue still exists.

Aravind Gottipati [:aravind]

Comment 35

•

18 years ago

Talked to John about this. I am moving this back to build and release, since at this point this is just a waiting game. B&R has the instructions on how to update the kernels. If they find that the new kernels don't fix the problems, please move it back to us and I will look into compiling or getting pre-built newer kernels.

Assignee: aravind → nobody

Status: REOPENED → NEW

Component: Server Operations → Build & Release

QA Contact: justin → build

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Assignee: nobody → joduinn

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Priority: -- → P2

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 36

•

18 years ago

staging-build-console.build.mozilla.org was renamed to staging-1.8-master (already updated by Aravind). staging-trunk-automation.build.mozilla.org was renamed to staging-1.9-master production-trunk-automation.build.mozilla.org was renamed to production-1.9-master build-console.build.mozilla.org was renamed to production-1.8-master I'll start doing these 3 today.

Status: NEW → ASSIGNED

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 37

•

18 years ago

I've updated staging-1.9-master to 2.6.18-53.1.6.el5.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 38

•

18 years ago

staging-trunk-automation.build.mozilla.org / staging-1.9-master is now updated to 2.6.18-53.1.4.el5, and rebooted. Note: we also had to do "yum install kernel-headers-2.6.18-53.1.4.el5". Thanks to bhearsum for helping with updating the kernel-source-headers, needed for rebuilding vmware packages, as part of this kernel update.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 39

•

18 years ago

(In reply to comment #38) > Note: we also had to do "yum install kernel-headers-2.6.18-53.1.4.el5". Thanks > to bhearsum for helping with updating the kernel-source-headers, needed for > rebuilding vmware packages, as part of this kernel update. > Actually, I installed kernel-headers.i386 and kernel-devel.i686. kernel-devel seems to be the necessary one for vmware tools.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 40

•

18 years ago

to clarify: joduinn installed kernel-headers-2.6.18-53.1.4.el5 bhearsum installed kernel-headers.i386 and kernel-devel.i686 No longer sure which one was "needed", but wanted to clarify what exactly was done.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 41

•

18 years ago

For record keeping, fx-linux-tbox had /var turn "ro" today.. see details in bug#18471.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 42

•

18 years ago

As "su", I did the following: # yum update kernel # yum update kernel-headers # yum update kernel-devel # shutdown -r now ...updating each of the following: staging-1.9-master (centos5) now at 2.6.18-53.1.13.el5 production-1.9-master (centos5) now at 2.6.18-53.1.13.el5 staging-1.8-master (centos4) now at 2.6.9-67.0.1.EL production-1.8-master (centos4) now at 2.6.9-67.0.1.EL ...and confirmed each buildbot master seems to be up and pinging to slaves ok.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 43

•

18 years ago

Updated the following: fx-linux-1.9-slave1 (centos5) now at 2.6.18-53.1.13.el5 fx-linux-1.9-slave2 (centos5) now at 2.6.18-53.1.13.el5 ...and confirm each buildbot slave up and pinging master ok.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 44

•

18 years ago

As "su", I did the following: # up2date -f kernel # shutdown -r now For this older version of RedHat, there were no "kernel-headers" or "kernel-devel". I updated the following VMs: staging-prometheus-vm (RedHat Ent3update8) now at 2.4.21-27.0.4.EL production-prometheus-vm (RedHat Ent3update8) now at 2.4.21-27.0.4.EL ...and confirmed each buildbot slave is up and pinging master ok.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 45

•

18 years ago

As "su", I did the following: # yum update kernel # yum update kernel-headers # yum update kernel-devel # shutdown -r now ...updating each of the following: fx-linux-tbox (centos5) now at 2.6.18-53.1.13.el5 fxdbug-linux-tbox (centos5) now at 2.6.18-53.1.13.el5 ...and confirmed tinderbox processes are up and running ok. Note: during the fxdbug-linux-tbux update, it initially refused to reboot. After help from justdave, we discovered two things: 1) the instructions for this kernel update missed needing to install vmware tools on the VMs. I will have to go back to each upgraded machine and confirm this is in place. 2) the setup of fxdbug-linux-tbox is different to all the other machines, in how rc.local was setup. This caused problems when trying to reboot after the kernel update. Its unclear why this machine is different to all others. Its unclear what is the "right" setting. See details in bug#420007

Dave Miller [:justdave]

Comment 46

•

18 years ago

1) Yes they did. The only ones I see in this bug do anyway. See comment 25. 2) Do the others have that rc.local hack already? If not, there's nothing any different. The rc.local hack wouldn't work in your case anyway because the new kernel was built with a different gcc than the one that's on the system, do the vmware-config-tools.pl will refuse to install anyway without a manual override.

Nick Thomas [:nthomas] (UTC+12)

Comment 47

•

18 years ago

xr-linux-tbox needed restarting for r/o partitions today.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Assignee: joduinn → nobody

Status: ASSIGNED → NEW

Component: Build & Release → Release Engineering

QA Contact: build → release

Summary: production-trunk-automation VM's / drive is being mounted ro → update linux VM kernel to fix intermittent "nfs drive mounted ro" problem

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Assignee: nobody → joduinn

Robert Helmer [:rhelmer]

Comment 48

•

18 years ago

Just ran into this on the newly minted fx-linux-1.9-slave1, we need to get this kernel upgrade into the ref platform (if indeed it helps this problem). (In reply to comment #47) > xr-linux-tbox needed restarting for r/o partitions today. Was this one upgraded?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 49

•

18 years ago

No, we stopped after hitting the VMtools problem in comment#45 and comment#46. We'll need to figure the VMtools problem out before we can resume.

Component: Release Engineering: Talos → Release Engineering

Nick Thomas [:nthomas] (UTC+12)

Comment 50

•

18 years ago

production-1.8-master had /data (/dev/sdd1) go r/o today. It has an updated kernel per comment #42.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 51

•

18 years ago

sm-try2-linux-slave was also hit. I updated the kernel and rebooted it.

Alias: read-only

Summary: update linux VM kernel to fix intermittent "nfs drive mounted ro" problem → update linux VM kernel to fix intermittent "drive mounted ro" problem

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 52

•

18 years ago

fx-linux-1.9-slave1 (the new one) was hit today. Upgraded the kernel and rebooted it.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 53

•

18 years ago

sm-try1-linux-slave was hit too.

matthew zeier [:mrz]

Comment 54

•

18 years ago

fyi, these were all on netapp-c-fcal1.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 55

•

18 years ago

robcee hit the same r/o problem this morning with: qm-moz2-unittest01 qm-moz2-centos5-01 ...so he upgraded the kernel on both of these.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 56

•

18 years ago

(In reply to comment #50) > production-1.8-master had /data (/dev/sdd1) go r/o today. It has an updated > kernel per comment #42. Nick: yikes. Aravind, mrz, Justin: The production-1.8-master VM was already updated with the kernel that supposedly fixed these read-only problems. Now what? Should I continue rolling out this kernel update, or is there something else causing this problem? Is production-1.8-master also on netapp-c-fcall, and if so, could there be a problem with that netapp?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 57

•

18 years ago

Justin and myself talked earlier this week, didnt get a chance to update this bug until now. 1) I should continue with the kernel rollout. Its fixing known kernel bugs anyway, and a good habit to do. For comparison, IT does this approx once a quarter across all their own systems out of proactive maintenance. 2) We should file a separate bug for IT to track this intermittent read-only issue. It possibly not just a VM-kernel issue. It might also be a problem with a specific ESX host or a specific shelf. I've now filed bug#430821.

Chris Cooper [:coop] (he/him)

Comment 58

•

18 years ago

Two r/o failings today: bm-centos5-moz2-01 qm-centos05-02 I'm updating the kernel on both now, and will update qm-centos5-03 at the same time as a preventative measure.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 59

•

18 years ago

qm-centos5-02 failed to boot after coop's kernel update this afternoon. Possible compiler differences while rebuilding vmtools? Kernel update was reverted by IT to stop qm-centos5-02 burning on tinderbox. For details, see bug#430820.

Depends on: 430820

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 60

•

18 years ago

production-master was hit over the weekend - it was not running the new kernel. bug 431136

Nick Thomas [:nthomas] (UTC+12)

Comment 61

•

18 years ago

Rebuilt the vmware modules on fx-linux-1.9-slave2 and rebooted, as the clock was drifting without the time sync. As per usual, ignored the warning about the kernel being compiled with gcc 4.1.2 while 4.1.1 is installed, and the failure to create the vmhgfs module (for the Shared folders feature that we don't use).

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Depends on: 430821

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 62

•

18 years ago

This morning, qm-centos5-02 was in readonly mode. Restarted. See Bug#432012 for details.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 63

•

18 years ago

qm-centos5-02 is read-only again this morning.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 64

•

18 years ago

Updated the kernel on qm-centos5-02 as follows: $ uname -a Linux qm-centos5-02.mozilla.org 2.6.18-8.el5 #1 SMP Thu Mar 15 19:57:35 EDT 2007 i686 i686 i386 GNU/Linux I've now updated the kernel using the following steps: As "su", I did the following: # yum update kernel kernel-headers kernel-devel # shutdown -r now ...and after reboot see: $ uname -a Linux qm-centos5-02.mozilla.org 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 2008 i686 i686 i386 GNU/Linux Finally, in VI client, noticed "VMware Tools: out of date" for qm-centos5-02. With mrz's supervision, I went into VI client, right click'd on the hostname, and picked "Install/Upgrade VMware tools", clicked ok, waited a couple of minutes to complete. Once completed, rebooted VM and confirmed that it now shows "VMware Tools: OK".

Rob Campbell [:rc] (:robcee)

Comment 65

•

18 years ago

I didn't think the kernel version in the yum repo included the fix we needed for this?

Justin Fitzhugh

Comment 66

•

18 years ago

according to vmware, it was back ported to centos/rhel 5.1, which is the kernel rev you are pulling, so you should be set.

Rob Campbell [:rc] (:robcee)

Comment 67

•

18 years ago

sorry for the bugnoise. We're tracking these kernel upgrades in a few bugs now so it's a bit confusing. Seems to have fixed it for now on qm-centos5-02 though it's still failing for some other reason.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Whiteboard: needs scheduled downtime

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

18 years ago

Depends on: 432933

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 68

•

18 years ago

After all the back-and-forth, along with updating kernels, we're now restarting this. For any centos4 VM, we're holding for dependent bug#432933. For any centos5 VM, we'll see if its running kernel 2.6.18-53.1.14.el5 (the newest available right now), which has the read-only fix. If so, great. Otherwise, if the VM is running an older kernel, we'll update it to 2.6.18-53.1.14.el5, using the instructions in comment#64.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 69

•

18 years ago

staging-master.build.mozilla.org was on kernel 2.6.18-53.1.6.el5, and is now updated to kernel 2.6.18-53.1.19.el5. VM rebooted, and VMware tools updated.

Nick Thomas [:nthomas] (UTC+12)

Comment 70

•

18 years ago

qm-centos5-01 was 2.6.18-8.el5 qm-centos5-02 was 2.6.18-53.1.14.el5 qm-centos5-03 was 2.6.18-53.1.14.el5 All three now 2.6.18-53.1.19.el5 - also updated vmware tools on qm-centos5-02. I used yum update {kernel,kernel-devel,kernel-headers}-2.6.18-53.1.19.el5 to force the package version (so that we get the same version everywhere and aren't vulnerable to other kernel updates being released while completing our rollout).

Rob Campbell [:rc] (:robcee)

Comment 71

•

18 years ago

qm-moz2-unittest01 was 2.6.18-8.el5 now 2.6.18-53.1.19.el5 as per method above. updated vmware tools as well.

Rob Campbell [:rc] (:robcee)

Comment 72

•

18 years ago

... not true.

Rob Campbell [:rc] (:robcee)

Comment 73

•

18 years ago

cloned copy of qm-centos5-moz2-01, qm-moz2-unittest01 is 2.6.18-53.1.14.el5.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 74

•

18 years ago

staging-1.8-master went read-only on /, /builds, and /var this morning. When i was looking for a new kernel (before rebooting) I got a bus error: [root@staging-build-console /]# yum update kernel kernel-smp Bus error When I rebooted it, I had to manually fsck /builds before it would boot up. This is a CentOS 4.4 machine, so we only got up to 2.6.9-67.0.15.EL (CentOS 4.4). Nick tells me we need CentOS 4.5 to fix the problem here.

Justin Fitzhugh

Comment 75

•

18 years ago

I don't think we (at least IT) has verified which kernel rev the vmware fix is in on the 4.x train. 5.x has been well documented - I'll see if I can't get an answer out of vmware today given aravind is out.

Nick Thomas [:nthomas] (UTC+12)

Comment 76

•

18 years ago

I was poking around on a Centos mirror and found http://www.centos.org/modules/smartfaq/faq.php?faqid=34 So it looks like updating the kernel to the "latest" will get us the latest-for-4.x rather than 4.4. They seem to be up to Centos 4.6 now, so we should be beyond the Centos 4 Update 5 requirement in the VMWare doc.

Justin Fitzhugh

Comment 77

•

18 years ago

sounds good - looks like we are good to go with the kernel from 4.6. if possible, please post the kernel rev you settle on so we have docs on what all the machines should be at.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 78

•

18 years ago

Tried to update staging-1.9-master, following steps in comment#70. However, as a new kernel (.1.21) is now available, these instructions dont work anymore. Updated staging-1.9-master with instructions that justdave came up with; these worked just great: $ su - # yum install yum-utils # yumdownloader kernel-2.6.18-53.1.19.el5 # yumdownloader kernel-devel-2.6.18-53.1.19.el5 # yumdownloader kernel-headers-2.6.18-53.1.19.el5 # rpm -ivh kernel-2.6.18-53.1.19.el5.i686.rpm # rpm -ivh kernel-devel-2.6.18-53.1.19.el5.i686.rpm # rpm -ivh kernel-headers-2.6.18-53.1.19.el5.i386.rpm ...and to confirm success, did: # yum list installed | grep kernel reboot, install new VMtools on VI client, and then restart builbot master & slave running on this machine.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 79

•

18 years ago

(ps: by copying around the 3 rpm files, we can avoid installing yum-utils, and running yumdownloader on each machine.)

Nick Thomas [:nthomas] (UTC+12)

Comment 80

•

18 years ago

Updated l10n-linux-tbox using the steps in comment #78.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 81

•

18 years ago

qm-rhel02 had /build go read-only today, it appears to be on netapp-d-fcal1. Rebooted to fix it, had to manually fsck /build. Updated kernel to 2.6.9-67.0.15.EL Restarted the masters, hoping the slaves will bring themselves back up. Added init scripts for auto-restart of masters.

Nick Thomas [:nthomas] (UTC+12)

Comment 82

•

18 years ago

Updated CentOS-5.0-ref-tools-vm to kernel-2.6.18-53.1.19.el5 using comment #78, except "rpm -Uvh" was needed for the kernel-headers.

Scott Holodak

Comment 83

•

18 years ago

See my comments on https://bugzilla.mozilla.org/show_bug.cgi?id=435134 I didn't see this ticket before I posted that comment, but you are having all of the symptoms we had. This is not a problem at the VM OS level, it has to do with the back-end storage between ESX and your NetApp. When the VM's OS issues a disk read/write out to the [virtual] disk controller, eventually that operation is handed off to the ESX kernel. If ESX kernel fails to handle that request in time or sends back garbage data for a long enough period of time, the read/write operation at the virtual OS level fails and generally gets logged as a 'hardware' error in the VM's logs. In your case, it looks like this is showing up as a Journal write failure on your VM's file system. Your VMs are re-mounting the volume as RO to reduce the risk of corrupting the disk as a result of journal write failures. If you can identify which of your ESX servers was running the VM at the time of the read-only mount, you can match up the VM's logs against the ESX server's logs. Both will report an IO failure at the same time ~90% of the time (see vmkernel logs on the ESX servers). I don't have the iSCSI version of the error message handy, but here is what it looks like with NFS: On Aug 24 at 05:17:02, we had a Disk and Symmpi error in the Windows Event log on one of our VM's. In /var/log/vmkernel: Aug 24 05:17:35 xyz-vmsrv2 vmkernel: 13:13:12:38.762 cpu3:1123)NFSLock: 514: Stop accessing fd 0x621a4fc 4 xyz-vmsrv2 Aug 24 05:17:35 xyz-vmsrv2 vmkernel: 13:13:12:38.762 cpu3:1123)NFSLock: 514: Stop accessing fd 0x621c32c 4 xyz-vmsrv2 There is a hack pseudo-workaround for this at the VM-level (on Linux and on Windows), which is to increase the timeout before the OS concludes that the read or write operation has failed. It will reduce the incidence of the error messages in the logs and prevent the read-only re-mount in most cases, but you're really just covering up the symptom and accepting incredibly poor disk performance. Make sure that this isn't the fix that you're applying. If it is, be aware that the term 'fix' should be used loosely. See my other comment for more info on troubleshooting/possible solutions.

Nick Thomas [:nthomas] (UTC+12)

Comment 84

•

17 years ago

Updated tb-linux-tbox from 2.6.18-53.1.4.el5 to 2.6.18-51.1.19.el5.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 85

•

17 years ago

Updated fxdbug-linux-tbox to 2.6.18-53.1.21.el5 after it hit a build error ('encountered NUL character...') today.

Nick Thomas [:nthomas] (UTC+12)

Comment 86

•

17 years ago

l10n-linux-tbox went r/o in bug 437176, despite being updated to the 2.6.18-53.1.19 kernel (comment #78). First glitch was at Jun 2 19:32:40 l10n-linux-tbox kernel: mptscsih: ioc0: attempting task abort! (sc=e1a43800) according to the system log.

Nick Thomas [:nthomas] (UTC+12)

Comment 87

•

17 years ago

xr-linux-tbox pulled a similar trick, it's using 2.6.18-53.1.14.el5. Jun 2 19:32:32 xr-linux-tbox kernel: mptscsih: ioc0: attempting task abort! (sc=c2f200c0) Jun 2 19:32:32 xr-linux-tbox kernel: sd 0:0:0:0: Jun 2 19:32:32 xr-linux-tbox kernel: command: Write(10): 2a 00 00 20 00 4f 00 00 18 00 Jun 2 19:32:32 xr-linux-tbox kernel: mptscsih: ioc0: task abort: SUCCESS (sc=c2f200c0) Repeated several times, then later Jun 3 14:54:10 xr-linux-tbox kernel: EXT3-fs error (device sdb1): htree_dirblock_to_tree: bad entry in directory #314816: rec_len is smaller t han minimal - offset=0, inode=2628, rec_len=0, name_len=0 Jun 3 14:54:10 xr-linux-tbox kernel: Aborting journal on device sdb1. Jun 3 14:54:11 xr-linux-tbox kernel: ext3_abort called. Jun 3 14:54:11 xr-linux-tbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal Jun 3 14:54:11 xr-linux-tbox kernel: Remounting filesystem read-only

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 88

•

17 years ago

moz2-linux-slave02 had /builds go read-only today. upgraded kernel from 2.6.18-53.1.14.el5 to 2.6.18-53.1.21.el5 and rebooted.

Nick Thomas [:nthomas] (UTC+12)

Comment 89

•

17 years ago

Do we want to take the latest kernel at install time, or pick a specific version for consistency across the Farm ?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 90

•

17 years ago

I installed the latest

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 91

•

17 years ago

qm-rhel02 had its drive go r/o this morning, and it already has kernel: 2.6.9-67.0.15.ELsmp. Justin suspects this is not a kernel problem, and instead is fallout from the ongoing netapp-c woes in bug#435134.

Rob Campbell [:rc] (:robcee)

Comment 92

•

17 years ago

qm-rhel02 went read only sometime around 11:55, 2008-06-09. Restarted qm-rhel02 and started buildbot masters.

alice nodelman [:alice] [:anode]

Comment 93

•

17 years ago

qm-rhel02 is r/o again - appears to have happened not long after the last reboot/master restart.

Justin Fitzhugh

Comment 94

•

17 years ago

I believe we have a fix implemented. Have had no errors on the netapp or esx hosts for the last 1.5 hours and all storage migrations are working. Would like to monitor overnight to ensure we truly have a fix. Let's bring things back up and watch closely for issues. If this is a fix, we may have to do some minor cleanup, but nothing urgent and can be scheduled.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 95

•

17 years ago

Noted that staging slave "moz2-linux-slave04" has newer kernel then all production slaves "moz2-linux-slave*". Upgraded moz2-linux-slave06 to see if that explains the http proxy problem we are hitting in bug#430200. $ su - # yum install yum-utils # yumdownloader kernel-2.6.18-92.1.10.el5 # yumdownloader kernel-devel-2.6.18-92.1.10.el5 # yumdownloader kernel-headers-2.6.18-92.1.10.el5 # rpm -ivh kernel-2.6.18-92.1.10.el5.i686.rpm # rpm -ivh kernel-devel-2.6.18-92.1.10.el5.i686.rpm # rpm -ivh kernel-headers-2.6.18-92.1.10.el5.i386.rpm ...and to confirm success, did: # yum list installed | grep kernel kernel.i686 2.6.18-92.1.10.el5 installed kernel.i686 2.6.18-53.1.19.el5 installed kernel-devel.i686 2.6.18-92.1.10.el5 installed kernel-devel.i686 2.6.18-53.1.19.el5 installed kernel-headers.i386 2.6.18-53.1.19.el5 installed reboot VM, install new VMtools on VI client, and then restart builbot slave running on this machine.

Whiteboard: needs scheduled downtime

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 96

•

17 years ago

Updating summary, as we havent hit r/o drive problems since the ESX host problems in bug#435134 were fixed.

Severity: major → normal

Depends on: 435134

Priority: P2 → P3

Summary: update linux VM kernel to fix intermittent "drive mounted ro" problem → update linux VM kernels to same new stable version

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 97

•

17 years ago

I found this bug while looking for another one...it looks like this is all done. All of the moz2-* and try-* linux VMs are on 2.6.18-53.1.19.el5. Going to close this as FIXED...

Status: NEW → RESOLVED

Closed: 18 years ago → 17 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering