Closed Bug 779487 Opened 12 years ago Closed 12 years ago

possible hardware issues with bld-centos6-hp-*.build.scl1.mozilla.com

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

Details

(Whiteboard: HP 4642125689 [reit-hp])

Not sure if this should go to dcops or the sres, so putting in the general queue for proper reassignment.

bld-centos6-hp-003 is refusing to boot from its disk saying that the drives were imaged on a newer firmware revision (uh, whut?).  Something sounds distinctly wonky with the hardware.  Just in case, I tried reinstalling with kickstart.  That appeared to work, but when the machine rebooted, it tried to boot off the network again giving the same error about the drive not being ready and imaged with the wrong version of the firmware.
Component: Server Operations → Server Operations: DCOps
QA Contact: jdow → dmoore
colo-trip: --- → scl1
bld-centos6-hp-001 now has the same problem.
Summary: possible hardware issues with bld-centos6-hp-003.build.scl1.mozilla.com → possible hardware issues with bld-centos6-hp-001.build.scl1.mozilla.com and bld-centos6-hp-003.build.scl1.mozilla.com
catlee is current user of both of these slaves (per slavealloc), keeping him in loop.
Whiteboard: [reit]
Whiteboard: [reit] → [reit-hp]
Depends on: b-linux64-hp-0023
Summary: possible hardware issues with bld-centos6-hp-001.build.scl1.mozilla.com and bld-centos6-hp-003.build.scl1.mozilla.com → possible hardware issues with bld-centos6-hp-{001..004}.build.scl1.mozilla.com
This has affected staging slaves only so far. We've been doing a lot of work on these in preparation for moving some jobs over to the production slaves. The failures on these staging slaves seems correlated with the increased load we've put on them during this development process.
If there are HP tools that should be installed here, I can do so in bug 741249.  I'm not seeing much that's helpful, though.
the same issue with bld-centos6-hp-013
Summary: possible hardware issues with bld-centos6-hp-{001..004}.build.scl1.mozilla.com → possible hardware issues with bld-centos6-hp-*.build.scl1.mozilla.com
dcops: this looks to be affecting every machine releng uses for staging.  The staging runs are prep to do the same thing in production.  So, when you're back in town, this should be a pretty high priority, especially since we may have to RMA drives or something like that.
In the iLO web interface's Information > Integrated Management Log, there is 
  POST Error: 1785-Drive Array not Configured
which appears to be a reasonable way to confirm this issue (assuming the timestamp is fresh).

bld-centos6-hp-007 is also in that state.
bld-centos6-hp-012 just ran into this issue as well
I think it's safe to say that any HP we put load on will fail within a few days.  Unless that's production load, there's probably not much sense in continuing to take out systems until we get to the bottom of this.
These hosts are showing no raid configuration during POST. Checked the raid bios and it shows that the logical drive is missing. We're recreating the logical raid 0 drive. This appears to be some kind of firmware issue.
Please let us know if the contents of the drive have been lost in this process.
After recreating the logical raid drives, all 6 hosts have booted up to the log in screen. DCOps does not have the credentials to log in to verify the contents.
Hal, drive contents were not lost.

Can you (releng) put load back on these six hosts?  If they stay up, we should perform the same maintenance on the remaining hosts.
I just installed hpacucli on bld-centos6-hp-001, and it doesn't find a controller.  Yet hitting F8 after the BIOS startup shows a HP Smart Array B110i SATA RAID Cont in slot 0.

I looked on bld-centos6-hp-015, which wasn't repaired earlier in this bug (and hasn't failed yet).  Its RAID config looks the same - a single logical drive, 232.9GB.  No license keys are installed.  Option ROM Config version 8.20.60.00.

So, I can't square the obvious presence of a RAID controller, which actually recommends using HPACU online, with the tool not finding any controllers:

----
[root@bld-centos6-hp-001 ~]# hpacucli 
HP Array Configuration Utility CLI 8.60-8.0
Detecting Controllers...Done.
Type "help" for a list of supported commands.
Type "exit" to close the console.

=> controller all show

Error: No controllers detected.

=> controller slot=0 show

Error: The controller identified by "slot=0" was not detected.

----
(same on bld-centos6-hp-015)

So, a bit of a mystery.  Let's get these loaded back up and see what happens.
According to Van in bug 782640:

"I've ran across this issue before while working at Yahoo. It turned out to be a
firmware issue with the raid controller and drives when there was a lot of
load. The hardware we used, although not the same are related (HP raid
controller, HP branded drives). We had to manually update the firmware on both
drives and controller."

I went looking through HPs firmware for the DL120, and I didn't see anything obvious about this.  Van, do you happen to remember the HP bug id, working firmware page, anything?
for those following along at home, these are DL120G7's (E3-1220, 8GB RAM)
I'm going to move this over to the SRE queue since they interface with HP over hardware issues.

SRE folks, to sum up, a number of centos 6.2 machine machines that had been running fine for months suddenly lose their array configurations and will no longer boot off of disk.  This has happened to about 10 machines already.  Going into the BIOS/RAID config tool and recreating the RAID (it's only one disk, so RAID 0), allows the machine to boot back up again without issue.  We don't have a way to reproduce this issue on demand, but we do have one machine that's down because of it right now (bld-centos6-hp-015.build.scl1.mozilla.com).

We checked out HP's site for obvious firmware patches, but didn't see anything that looks like it talks about this specific case.  There were some OS level packages for linux that talked about RAID corruption, but that doesn't seem to fit the bill here.

I've left bld-centos6-hp-015.build.scl1.mozilla.com booted up in the RAID configuration screen for further troubleshooting.

Thanks in advance for the help.
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → jdow
At the first glance, I didn't see anything obvious to cause this.
My suspicion is that there is buggy BIOS or some firmware.
I will run the Smart Update Firmware DVD on the node that's down.

The Smart Update Firmware DVD delivers a collection of firmware for your ProLiant servers and options. Update your ProLiant firmware using one of the following methods; HP Smart Update Manager, ROMPaq (iLO only), or Online ROM flash components.
Assignee: server-ops → dgherman
bld-centos6-hp-015.build.scl1.mozilla.com had only two upgrades: iLO and BIOS.
Fixing the array now and please let me know if the problems persist.
Dumitru: how invasive is it to install this on the other machines?  Does it require a reboot, downtime, etc?  I'm not sure it's going to be a fix, but I'd like a sample size of a few machines, at least.
Yeah, it does require a reboot. You need to mount the DVD .iso from your local computer (maybe there's a better way to do it for you remoties, but from MV office this is the fastest way for me). Then let it automatically scan the system and apply all the updates.
The download link for the DVD is:

http://tinyurl.com/8joxqyr

Make sure to get the latest version. As of today, this is 10.10 (4 Jun 2012).
To clear up some misconception here, we do not know if the patches fixed the issue. I haven't seen the issue on *any* more machines since we patched this one.  Are all of the HPs under load right now, and has load been added back to bld-centos6-hp-015?
To summarize:
* These systems have been running for months with no discernible issues.

* The presentation of this issue was that the system in question lost its RAID configuration between reboots.  

* As far as I know, we are unable to reproduce this error at will nor do we understand what causes it.

* There was some speculation that this might be a firmware issue, but there is nothing that we've found via HP that mentions this error mode.

* The following machines lost their raid config and had it rebuilt by relops or dcops:

bld-centos6-hp-001 
bld-centos6-hp-002 
bld-centos6-hp-003 
bld-centos6-hp-004 
bld-centos6-hp-007
bld-centos6-hp-012
bld-centos6-hp-013 

* This machine was patched with the latest general firmware/ilo patches and had its raid rebuilt:

bld-centos6-hp-015

* The other HP machines have not presented with issues so far.

* None of the HP machines have presented with issues since bld-centos6-hp-015 was patched.

So at this point, we have three classes of HP machines running, those that have never had a problem, seven that have had an issue and had the RAID config rebuilt, and one that has been patched with a general ilo and firmware patches and had its RAID config rebuilt.

* We are not sure if these patches will solve the problem.
* We also want to make sure that these patches did not impact the building process (they shouldn't have, but we want to be sure).
* The machine that we patched has not been put back into service yet (bhearsum was working on that now)

We are more than happy to add the patch to more servers but can not offer any proof that this is going to solve the problem.
Well, 015 has been offline for I don't know how long, but I've got it back in the production pool now and it should be relatively loaded from here on in. I'm not 100% sure if that helps us figure out what's next here. Please let me know if there's anything else I can do to help out.
bld-centos-hp-16 is also showing this issue now.  I'm going to fix the raid config, boot it back up, then patch the ilo and firmware.  I'll re-enable it after I'm done.
I've patched bld-centos-hp-16 by using the virtual media option and booting off of the firmware ISO on the kickstart server: http://10.12.75.25/FW1010.2012_0530.49.iso

I've now put the slave back into rotation.
Plan of record: releng is fine with applying these as they occur (we don't yet know if they are a fix)
I've patched bld-centos-hp-018 by using the virtual media option and booting off of the firmware ISO on the kickstart server: http://10.12.75.25/FW1010.2012_0530.49.iso

The slave is back into rotation and updated slavealloc with a comment about the firmware being patched.
Assignee: dgherman → arich
Component: Server Operations → Server Operations: RelEng
QA Contact: jdow → arich
colo-trip: scl1 → ---
I premptively told the first 10 machines to do their next boot off of the firmware CS and patch.  007 and 008 are still listed at ilo 1.26, despite trying to get them to patch several times.

dumitru, could you take a look at those two?  I've left them disabled, so feel free to do whatever necessary to beat them unto submission.

Thanks!
Manually resetting the ilo on 7 and 8 seems to have "fixed" the firmware update failure.
At this point, all of the bld-centos6-hp machines have had their firmware upgraded except bld-centos6-hp-034 which is waiting on an ilo reset.

The following machines still need to have their firmware patched (and may, unlike the bld machines, require downtimes):

talos-w8-hp-001.releng.ad.mozilla.com
talos-w8-hp-002.releng.ad.mozilla.com
talos-w8-hp-003.releng.ad.mozilla.com

foopy25.build.scl1.mozilla.com
foopy26.build.mtv1.mozilla.com
foopy27.build.mtv1.mozilla.com
foopy28.build.mtv1.mozilla.com
foopy29.build.mtv1.mozilla.com
foopy30.build.mtv1.mozilla.com
foopy31.build.mtv1.mozilla.com
foopy32.build.mtv1.mozilla.com
Per my conversatoin with Armen about the state of the w8 hps, all of the HP machines have had their firmware patched now except those that were originally repruposed as foopies:

foopy25.build.scl1.mozilla.com
foopy26.build.mtv1.mozilla.com
foopy27.build.mtv1.mozilla.com
foopy28.build.mtv1.mozilla.com
foopy29.build.mtv1.mozilla.com
foopy30.build.mtv1.mozilla.com
foopy31.build.mtv1.mozilla.com
foopy32.build.mtv1.mozilla.com

Hal/Callek, do we need a downtime to patch those, or are they not yet in production?
(In reply to Amy Rich [:arich] [:arr] from comment #34)
> Per my conversatoin with Armen about the state of the w8 hps, all of the HP
> machines have had their firmware patched now except those that were
> originally repruposed as foopies:
> 
> foopy25.build.scl1.mozilla.com
> foopy26.build.mtv1.mozilla.com
> foopy27.build.mtv1.mozilla.com
> foopy28.build.mtv1.mozilla.com
> foopy29.build.mtv1.mozilla.com
> foopy30.build.mtv1.mozilla.com
> foopy31.build.mtv1.mozilla.com
> foopy32.build.mtv1.mozilla.com
> 
> Hal/Callek, do we need a downtime to patch those, or are they not yet in
> production?

They are in production, but do not need a tree-closing downtime, but do need at least 24-48 hours for releng to coordinate our end of the "downtime" for this.
callek: okay, when do you want to schedule these?  We can easily do them one at a time.
Since you're both waiting for the other to make a move:

> foopy25.build.scl1.mozilla.com
> foopy26.build.mtv1.mozilla.com
> foopy27.build.mtv1.mozilla.com
Friday at noon pacific

> foopy28.build.mtv1.mozilla.com
> foopy29.build.mtv1.mozilla.com
> foopy30.build.mtv1.mozilla.com
Monday at noon pacific

> foopy31.build.mtv1.mozilla.com
> foopy32.build.mtv1.mozilla.com
Tuesday at noon pacific
update per IRC: let's do them all Friday at noon pacific.
update per IRC:

(note out of order)
> foopy27.build.mtv1.mozilla.com
> foopy28.build.mtv1.mozilla.com
> foopy29.build.mtv1.mozilla.com
> foopy30.build.mtv1.mozilla.com
> foopy31.build.mtv1.mozilla.com
> foopy32.build.mtv1.mozilla.com
Friday at noon pacific

> foopy25.build.scl1.mozilla.com
> foopy26.build.mtv1.mozilla.com
Monday at noon pacific
(In reply to Dustin J. Mitchell [:dustin] from comment #39)
> > foopy25.build.scl1.mozilla.com
> > foopy26.build.mtv1.mozilla.com
> Monday at noon pacific

Joel/Clint, FYI we'll temporarily lose these foopies at ~ that time, do you want/need me to handle prepping for this on monday, or will you?
Well, so much for this patch fixing the issue.  bld-centos6-hp-004 got both the ROM and ilo patch, and still lost it's RAID configuration yesterday.  Going to pass this back to dumitru for further investigation directly with HP.  Dumitru, I've pulled this machine out of production, so do with it what you will.
Assignee: arich → dgherman
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → jdow
bld-centos6-hp-013 is also now in the same state.  Patched, still lost it's RAID config.
Severity: normal → major
The following have been patched:

foopy27.build.mtv1.mozilla.com
foopy28.build.mtv1.mozilla.com
foopy29.build.mtv1.mozilla.com
foopy30.build.mtv1.mozilla.com
foopy31.build.mtv1.mozilla.com
foopy32.build.mtv1.mozilla.com

That leaves these for monday:

foopy25.build.scl1.mozilla.com
foopy26.build.mtv1.mozilla.com
(In reply to Amy Rich [:arich] [:arr] from comment #41)
bld-centos6-hp-001 appears to have gotten back into this state too.
Dumitru: bumping this up to critical since it's taking out production machines and we don't have a fix.
Severity: major → critical
Whiteboard: [reit-hp] → HP 4642125689
Whiteboard: HP 4642125689 → HP 4642125689 [reit-hp]
Nagios is saying bld-centos6-hp-017 and 18 are down. Is that from work here or have they fallen over ? 18 is at a BIOS screen, 17 isn't showing me any video in the remote console
Hello Dumitru,

 My name is Gill from the Proliant L2 team and I took ownership of your DL120 G7 issue loosing the disc drive configuration.

 I was looking at the reports you sent us and I would like to share few things with you.

 At the first glance, I did look at the version of firmware for the system, disc controller and disc drive as well.

 I am not sure about the OS you are using on this system but from the report you sent us, it seems to be CentOS 6 but I would need confirmation?

 I did look at the dics drive firmware and do see HPG0. I don't know if all your DL120G7 discs are the same but if this is the case, it might explain why you are getting this issue.
 
 [ Physical Drive 1I:1:1 ]  
 Physical Drive Status
SCSI Bus                              0  (0x00)  
SCSIID                                 0  (0x00)  
Block Size                             512 Bytes Per Block  (0x0200)  
Total Blocks                          250 GB  (0x1d1c5970)  
Reserved Blocks                   0x00010000 
Drive Model                          ATA VB0250EAVER    <<<=======
Drive Serial Number              Z2A3TVS4  
Drive Firmware Revision       HPG0            <<<============
SCSI Inquiry Bits                   0x00 

 The latest Firmware for those discs is HPG7, as you can see in the link below:
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3690351&swItem=MTX-703620254a70407c96d79c5faa&prodNameId=4134173&swEnvOID=4103&swLang=13&taskId=135&mode=5

You can download it from:
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=4134173&prodTypeId=329290&prodSeriesId=3690351&swLang=13&taskId=135&swEnvOID=54

 It contains the following fixes:

Version: HPG7 (B)  (4 Sep 2012) 
      Fixes 
  Upgrade Requirement:
 
Online firmware flashing of drives attached to an HP Smart Array controller running in Zero Memory (ZM) mode or an HP ProLiant host bus adapter (HBA) is NOT supported. Only offline firmware flashing of drives is supported for these configurations. 
--------------------------------------------------------------------------------
Problems Fixed

Resolved an issue where the drive was not recognized when the power was turned on.  <<===
Improved drive performance when booting at cold temperatures. 
Properly set the drive write cache to off as the default power-up setting. 
 
 So, I would suggest to update the disc firmware to HPG7 and see if it helps.

 Please let me know if you have any questions.

Thanks,
Gilles Lucier
I spent most of the day figuring out an automated way to perform those firmware upgrades on the drives, but looks like the only choice we have is to use the USB bootable image.
We cannot use other methods because:
a) the RAID controller is one with zero memory, thus online upgrades are not possible (running the upgrade software from the OS)
b) offline upgrades using HP Service Pack for Proliant DVD are not possible for this controller. Even if the latest SPP contains this upgrade, when selected it says to use the USB key method.

When I left the office I let the HP USB Key Creator Utility to run and finish the creation of this bootable image from an USB drive.
I'll test it tomorrow with DCOps help.
Any luck with the patching?
The first USB I created didn't work at all, so I made a new one.
This one worked, and the system booted with it, but unfortunately the upgrade is not seen by the HP SUM. I exhausted all my Google-fu and troubleshooting, and emailed back our HP level 2 engineer for support. Waiting on his advice.
Hi Dumitru,

 I do see what you mean. I will need to research on this and probably will need to try to reproduce it in our lab.

Thanks,
Gilles Lucier
CPRQ GCC ISS/SW Engineer
Hello Dumitru,

 I want to give you an update on this. I was able to reproduce the issue in our lab but I do not have a solution for you, yet.
 I will work on this and will get back to you when I'll have new info. 

Thanks,
Gilles Lucier
CPRQ GCC ISS/SW Engineer
Hello Dumitru,

 I tought I had a solution but finally, it turned out that it doesn't work. I am thinking about something else, I will need to verify if this could work and I will let you know. 
 I have been out of the office for few days last week and I appologize for the delay but I should be able to get back to you by the end of this week, on this. 

Regardss,
Gilles Lucier
I upgraded the HDD firmware on bld-centos6-hp-001 and bld-centos6-hp-015 and the method[1] works.
Unfortunately we can't automate this, thus please send me a list of hosts that can be taken offline for 10 minutes for this upgrade.

We can also do this via IRC, just ping me.

[1]https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=29329401
Unfortunately, we'll need to coordinate down times. The HP's deployed as foopies need approx 2 hours to bring down cleanly, and we can only have a small number per batch, due to impact on test pool.

The HP's deployed as mock builders can be likely be batched, but need about the same time to be taken offline cleanly.

Add all that together, and we're likely going to do 2 batches a day max. We'll work out rest of the details on IRC, and log progress here.
Alright, let me know on IRC when you guys are ready to do these, thanks!
Status: NEW → ASSIGNED
bld-centos6-hp-013 seems to be down.  Now disabled in slavealloc as well.  If you can get to it, fire away.
(In reply to Aki Sasaki [:aki] from comment #57)
> bld-centos6-hp-013 seems to be down.  Now disabled in slavealloc as well. 
> If you can get to it, fire away.

Fixed this one, too.
bld-centos6-hp-002 fixed
(requested on IRC by jhopkins and hwine)
bld-centos6-hp-012 is down, which makes it a candidate, but bld-centos6-hp-013 is back down again post-fix.
I may know why bld-centos6-hp-013 is sad, so I re-flashed it.
bld-centos6-hp-012 flashed with the new firmware.
013 and 001 are still experiencing the same issue even with the new HDD driver.
I updated HP.
New logs sent to HP, waiting on their input now...
Hi Dumitru,

 I was looking at this and I found an issue with drive not recognized when serial port is enable in BIOS. This was seen on a different server but in the same family and with the same array controller B110i. I can not tell if this is related or not but if you reboot your server, you can check the BIOS setting for the serial port configuration to see if it is enable or disable. This is a quick test that can be done. 

 Also, there is a new BIOS that should come out this week-end for the DL120G7. It talks about the serial console but, again, I do not know if this could help or not since I do not see any mention about the B110i controller in this new BIOS. 

 However, looking at this other issue reported about the serial port, it could make me think there might be a relation between both.

 You can try to look at the serial port config and toggle it, if it doesn't affect your operations, and we will see if it changes something.

Thanks,
Gilles Lucier
CPRQ GCC ISS/SW Engineer 

--

I disabled serial ports on bld-centos6-hp-013.
This alert should not page oncall. 

<nagios-releng> Thu 15:50:08 PDT [459] bld-centos6-hp-018.build.scl1.mozilla.com:disk - / is CRITICAL: Timeout while attempting connection
disregard comment 65 - wrong bug
I've repaired bld-centos6-hp-016, 018 and 019 to see how fast they fall over again. If it's not too quick I think it would be worth picking up machines as they fall, while HP are working on a permafix.
Hi Dumitru,

 Thanks for the reports and answers to my questions. One other thing I forgot to ask you because I found something concerning the B110i array controller. It seems that those are using the driver to load the "firmware", as you can see in the following advisory:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02862729&lang=en&cc=us&taskId=101&prodSeriesId=5075933&prodTypeId=15351 

You will see: "This occurs because the B110i firmware runs from the operating system driver instead of the Option ROM. " 
  This is completly different than the other array controllers.

 So, the reports you sent me are not containing any driver version info and since we do not "officially" support driver installation on CentOS, knowing that CentOS is quite similar to RHEL, the only recommendation I can tell you is to make sure you have the latest hpahcisr driver version for RHEL, according to your CentOS equivalence, which is probably RHEL 6. Just pay attention at the Update#.

 Maybe you can verify the version of hpahcisr driver you have and see if it is the latest available, as described on the following site:
 http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=5075933&swItem=MTX-02e812787510460e8422d5bb65&prodNameId=5075937&swEnvOID=4103&swLang=8&taskId=135&mode=5
 Make sure you read the Installation instructions tab to use the appropriate version of the driver.

 Please let me know what HPAHCISR driver version you are using.

Since you are on the latest disc drive firmware, the controller FW needs to be examined to make sure we are on the latest also. BTW, did you try to install the latest BIOS on one of the DL120, which is 2012.08.10, that came out last friday. We may want to try it if we do have the latest array controller driver already installed.
 
 Do we have an equivalent to the RHEL "sosreport", on CentOS? If so, can you send it to me, please?

Thanks,
Gilles Lucier
Hi Gilles,
Looks like CentOS has sosreport integrated, attached the file.
I found something interesting:

[root@bld-centos6-hp-001 ~]# lspci | grep -i raid
00:1f.2 RAID bus controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA RAID Controller (rev 05)

Apparently, the OS thinks this is an Intel controller!
I have found several blog posts online about CentOS and not seeing this controller properly, sometimes even seeing the drives directly (people reporting that they have configured RAID 1 with two drives, but CentOS didn't see the logical drive, but two disks instead).

I am imaging now one of the servers with RHEL 6 to verify how the controller appears, and then I will have more to think at.

Thanks, let me know what your thoughts are.
Hello Dumitru,

 Like you said, very interesting. I did look at the sosreport you sent and it doesn't contains all details about the driver but there is enough to say the driver is the correct one, and when I say correct, I mean this is the AHCI driver but not the one from HP, made specially for this B110i controller. This means it doesn't contain all the necessary code to fix the controller issues, which might lead us to strange controller behaviours.

 That will be nice if you can effectively mount a DL1230G7 with RHEL and see how it will detect it. However, for the best controller performance, I would recommand to install the HP version of the AHCI driver (hpahcisr), because it will contains the necessary code to optimize your B110i controller.

 Usually the HP supplied driver, under the RHEL equivalent version, from the HP download site, is working pretty well with CentOS, even, like I said earlier, we do not "officially" support it. So, if it does work correctly on your RHEL system, you may want to try it on one of your CentOS system that is not critical, for test purpose.

Thanks,
Gilles Lucier
Gilles,
Here are some reports from a host running RHEL 6.

[root@bld-centos6-hp-009 ~]# lsscsi
[0:0:0:0]    disk    ATA      VB0250EAVER      HPG0  /dev/sda
[root@bld-centos6-hp-009 ~]# lspci | grep -i raid
00:1f.2 RAID bus controller: Intel Corporation 6 Series/C200 Series 
Chipset Family SATA RAID Controller (rev 05)
[root@bld-centos6-hp-009 ~]# cat /etc/issue
Red Hat Enterprise Linux Server release 6.3 (Santiago)
Kernel \r on an \m

Very interesting, right? Instead of seeing the logical volume, it 
detects the drive itself. This puzzles me.
I am attaching a sosreport from this system, running RHEL 6.

Funny enough:

[root@bld-centos6-hp-009 ~]# rpm -Uvh 
kmod-hpahcisr-1.2.6-14.rhel6u3.x86_64.rpm
Preparing...                ########################################### 
[100%]
     #######################################################
     # Hpahcisr is currently not controlling any storage.  #
     # Loading this driver could displace the current      #
     # storage driver causing filesystem corruption.       #
     # Exiting!                                            #
     #######################################################
error: %pre(kmod-hpahcisr-1.2.6-14.rhel6u3.x86_64) scriptlet failed, 
exit status 1
error:   install: %pre scriptlet failed (2), skipping 
kmod-hpahcisr-1.2.6-14.rhel6u3

[root@bld-centos6-hp-009 ~]# cat /etc/issue
Red Hat Enterprise Linux Server release 6.3 (Santiago)
Kernel \r on an \m
These machines have been broken for 3mo now.  Who can we escalate to to get this resolved?
Remember that you are using an officially unsupported OS on them.

HP engineer sent me this article to try:
  http://linuximagination.blogspot.com/2011/04/centos-installer-wasnt-detecting-sata.html

His last email that he sent me yesterday was:
"I did some researches but wasn't able to find any answer yet. I will try to look at it tomorow but I might need to reproduce it if I do not find anything documented, which may take more time. 

 It would have been interesting to know if the procedure described in this post would work ... 

Let me check on my side what I can find and I will keep you posted."


I just don't have the time to do it, working on my other quarterly goals currently.
If you can find the human resources to do what that article says, that'd be great.
I've gone a different route since we don't seem to be able to get a fix for this bug.  At the moment, I'm switching from RAID to AHCI support to see if that solves the issue.  I'm tracking the machines that I update in slavealloc.

So far, that's:

bld-centos6-hp-007
bld-centos6-hp-008
bld-centos6-hp-013
bld-centos6-hp-015
bld-centos6-hp-016
bld-centos6-hp-017
bld-centos6-hp-018

Since fixing 7 and 8 yesterday, I haven't seen them go back down yet, so I am cautiously optimistic.

I will also re-kickstart bld-centos6-hp-009 and apply the same change.
(In reply to Amy Rich [:arich] [:arr] from comment #75)
> I've gone a different route since we don't seem to be able to get a fix for
> this bug.  At the moment, I'm switching from RAID to AHCI support to see if
> that solves the issue.  I'm tracking the machines that I update in
> slavealloc.

Just for the record, can you confirm we were running RAID 0 previously? I.e. we were never depending on raid for any disc corruption/recovery, so moving away from raid does not change anything about using, monitoring, or managing this box. (I hope I'm correct in that statement.)
I had the same idea yesterday, and after discussing with various people, they advised to try it.
I enabled it on 009 last night to see if the change breaks anything, and indeed didn't.
I put it back to RAID because I installed the hp driver on RHEL, but that made the system to kernel panic. So 009 can be all yours, since we want to try this on it too.

In the meantime I also emailed Rich and he said he will try and get us more HP resources on the case.
QA Contact: jdow → shyam
(In reply to Hal Wine [:hwine] from comment #76)
> (In reply to Amy Rich [:arich] [:arr] from comment #75)
> > I've gone a different route since we don't seem to be able to get a fix for
> > this bug.  At the moment, I'm switching from RAID to AHCI support to see if
> > that solves the issue.  I'm tracking the machines that I update in
> > slavealloc.
> 
> Just for the record, can you confirm we were running RAID 0 previously? I.e.
> we were never depending on raid for any disc corruption/recovery, so moving
> away from raid does not change anything about using, monitoring, or managing
> this box. (I hope I'm correct in that statement.)

Correct, it was RAID 0, but even with the SATA controller in RAID mode, it was detected wrong by the OS and it shows up using the "ahci" drivers, go figure.
QA Contact: shyam → jdow
QA Contact: jdow → shyam
At this point, I think I've disabled RAID on all of the slaves and switched them to AHCI.  Interestingly, I found a few of them were set to Legacy SATA instead of RAID or AHCI SATA.

Assuming that this does prove to be the fix, we still need to do this to the handful of machines that have been repurposed as things other than mock slaves.

Here's the list of machines that are done (and marked as such in slavealloc):

bld-centos6-hp-001
bld-centos6-hp-002
bld-centos6-hp-003
bld-centos6-hp-004
bld-centos6-hp-005
bld-centos6-hp-006
bld-centos6-hp-007
bld-centos6-hp-008
bld-centos6-hp-009

bld-centos6-hp-012
bld-centos6-hp-013

bld-centos6-hp-015
bld-centos6-hp-016
bld-centos6-hp-017
bld-centos6-hp-018
bld-centos6-hp-019

bld-centos6-hp-024
bld-centos6-hp-025
bld-centos6-hp-026
bld-centos6-hp-027
bld-centos6-hp-028
bld-centos6-hp-029
bld-centos6-hp-030
bld-centos6-hp-031
bld-centos6-hp-032

bld-centos6-hp-035
Assignee: dgherman → arich
Component: Server Operations → Server Operations: RelEng
QA Contact: shyam → arich
No RAID lossage since the fix last week.  Callek, when do you want to do the ones that were retasked as foopies, and how many do you want to do at once?

foopy25 - foopy37 are the ones that need a quick reboot.
Flags: needinfo?(bugspam.Callek)
No RAID lossage since the fix last week.  Callek, when do you want to do the ones that were retasked as foopies, and how many do you want to do at once?

foopy25 - foopy37 are the ones that need a quick reboot.
per callek on irc yesterday, we're going to shoot for fixing the foopies on Nov 20th.  SPecific times forthcoming.
(In reply to Amy Rich [:arich] [:arr] from comment #82)
> per callek on irc yesterday, we're going to shoot for fixing the foopies on
> Nov 20th.  SPecific times forthcoming.

Sorry meant to comment here on the 15th/16th, Lets shoot for 20th, window for IT from 9am PT to 11am PT.

IT work can begin right at 9, just ping me in IRC, and should complete no later than 11am PT [given my understanding and chat with :arr, thats much more than enough time] I'll be on point for bringing the software on the systems back up properly when IT is done.
Flags: needinfo?(bugspam.Callek)
foopy25 - foopy37 have been modified as well.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.