Closed Bug 874414 Opened 11 years ago Closed 11 years ago

Security Assurance ESX server seems to have hardware failure

Categories

(Infrastructure & Operations :: Virtualization, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: decoder, Assigned: dparsons)

References

Details

(Whiteboard: HP Case ID 4644950245)

The ESX server we got in bug 765174 seems to have a disk problem. I first noticed that the second VM on that host had I/O errors and a read-only filesystem, then it hung up completely. The first one was still working but is gone now too. I logged into the ESX server via SSH and saw this in dmesg (tons):


2013-05-21T12:14:28.421Z cpu6:4102)ScsiDeviceIO: 2316: Cmd(0x412400803b00) 0x28, CmdSN 0x955ecd from world 4116 to dev "naa.600508b1001030364238313730300d00" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1.
2013-05-21T12:14:30.317Z cpu2:4098)<4>hpsa 0000:0c:00.0: Device:C2:B0:T0:L0 Command:0x28 CC:04/3e/01 has hardware error.


Can you check what's going on there? :) Thanks!
Looks like a hardware problem to me too. Passing over to dcops for eval / drive swap.

:decoder, do you have backups of any important data on the drives?
Assignee: server-ops-virtualization → server-ops-dcops
Component: Server Operations: Virtualization → Server Operations: DCOps
QA Contact: dparsons → dmoore
(In reply to Dan Parsons [:lerxst] from comment #1)

> :decoder, do you have backups of any important data on the drives?

There is no critical data on any of the drives. It would of course be good if content could be preserved, but they can be setup again from scratch with some effort :)
:dumitru/SRE, any insight on this? Both drives are showing an amber light which indicates there's an issue but it's uncommon for both drives to fail at once. I'm also not seeing any issues reported by iLO IML either.

:lerxst, I can replace both drives if it's a hardware issue. Should I kick it back to you when done?
09:50 <van> any insight on this 874414
09:51 <van> you're the expert on drives
09:51 <van> and firmware
09:52 <dumitru> so, if that's running ESX, we don't have any online tools to verify the drives
09:52 <dumitru> you need to boot off the Smartstart DVD and run offline diagnosis
09:53 <van> ugh
09:53 <van> :(
09:53 <van> ok
These are HP branded SAS drives so it's not the typical firmware issues we've had with the SSD drives. We happen to have a lot of spare 300GB drives so I've swapped out these failed drives and I'll run tests on them using a spare blade. Punting back to Dan's team.
Assignee: server-ops-dcops → server-ops-virtualization
Component: Server Operations: DCOps → Server Operations: Virtualization
QA Contact: dmoore → dparsons
:van, before I go through reinstalling ESX, can you tell me what errors were reported by the raid controller? I'd hate to reinstall and find out it's the controller that's bad and not the drives...
Assignee: server-ops-virtualization → dparsons
:lerxst, no errors were reported on the RAID controller through iLO's IML. I pinged :gcox to try to log in and we tried several passwords without any luck.
:van, you tried logging into what, ESX? This server is managed by Security Assurance, presumably they changed the password.
I tried, with no expectation of success, earlier today:

<van> is the root pw for vsphere on the sysadmin gpg file/
<van> or better yet, can you log into secfuzzesx.sec.scl3.mozilla.com
<van> and let me know if you're seeing anything?
<gcox> So, that box was never one we administered; lerxst set it up and handed it off blind.  So, any passwords are probably long gone, or theirs instead of ours.  I'll try a couple, but the better people to ask are the box owners. :)
<van> ok thanks, i tried pinging decoder in irc
<gcox> Yeah, I tried, but I'm useless here, sorry.
:decoder, was your group monitoring the hardware health on the system? Is it possible one drive died and no one noticed because RAID kept everything working, and then when the second drive died, finally people noticed? I'm trying to prevent this from happening again.
I booted up the blade and found signs of RAID controller failure:

1726-Slot 0 Drive Array - Array Accelerator Memory Size Change Detected

Seems like possibly bad RAM in the RAID controller. Can dcops please fix/replace as necessary?
Component: Server Operations: Virtualization → Server Operations: DCOps
QA Contact: dparsons → dmoore
(In reply to Dan Parsons [:lerxst] from comment #10)
> :decoder, was your group monitoring the hardware health on the system?

I am not aware that we did, no. I also gave van the root password, in case you guys need it.
I have them as well if it comes up again.
:lerxst, thanks. That error did not come up the first time I rebooted the host. It makes a lot more sense that both drives failed at once due to a bad RAID controller. 

The drive hardware surface scan also finally finished and one of the drives failed the "Scattered Read Test". I have contacted HP to send a tech on site to replace the bad RAID controller, cache memory and hard drive.  

[Tuesday, May 21, 2013 5:38 PM] -- Abdul Kader S says:
As of now, I am processing the case for onsite replacement of the board, cache memory and hard drive. The case # is 4644950245

I'll kick this bug back to you once the DCOPs portion is done.
Whiteboard: HP Case ID 4644950245
Assignee: dparsons → vle
colo-trip: --- → scl3
HP came on site to swap out the bad hardware. However, they noticed that one of the capacitors on the system board might be bad after some tests. He has ordered additional parts and will swing back tomorrow around 10AM for repairs.
Depends on: 875117
HP came on site to replace remaining bad hardware. I've opened 875117 to update the RAID firmware/BIOS as we're a few revisions behind. Punting ticket back to Dan's team.
Assignee: vle → server-ops-virtualization
Component: Server Operations: DCOps → Server Operations: Virtualization
QA Contact: dmoore → dparsons
Assignee: server-ops-virtualization → dparsons
Well, this is weird. I just logged into iLO to reinstall ESX on this system and miraculously, it is already installed. I guess maybe the RAID came back? Anyway, can someone log in and make sure things look good to you?
No response for 3 days, so resetting to 'normal'.

Can anyone test the box to make sure it's working for you?
Severity: critical → normal
Sorry, I am on PTO. I will be testing on Monday :)
Sorry. Decoder had the week off and since he reported the initial problem and is using the box, I wanted him to be the one to say so if it was ok now.
Al tried to start our virtual machines but ESX says the license expired. Can you please check that and put the proper license in there? Thanks.
Specifically, it won't let us start any of the virtual machines because it says our evaluation license has expired and we need a license.
Two weeks later. Can we get a license please?
Flags: needinfo?(dparsons)
Apologies for the delay on this. The deal with this server, from the beginning, is that it is not managed by my team. Which means you're on your own for licensing. If you go to vmware.com and make a free account, you can get a license for the free edition of ESXi and put that in. Look over the differences between free and pay-for versions first.
Flags: needinfo?(dparsons)
All right. It is just kind of a surprise that it works and then suddenly it doesn't work. I don't think anyone ever told us that we'd need to go out on our own and figure out our own licensing.
License key installed. VMs booted. All yours, Decoder.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.