874414 - Security Assurance ESX server seems to have hardware failure

Reporter

Description

•

12 years ago

The ESX server we got in bug 765174 seems to have a disk problem. I first noticed that the second VM on that host had I/O errors and a read-only filesystem, then it hung up completely. The first one was still working but is gone now too. I logged into the ESX server via SSH and saw this in dmesg (tons): 2013-05-21T12:14:28.421Z cpu6:4102)ScsiDeviceIO: 2316: Cmd(0x412400803b00) 0x28, CmdSN 0x955ecd from world 4116 to dev "naa.600508b1001030364238313730300d00" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1. 2013-05-21T12:14:30.317Z cpu2:4098)<4>hpsa 0000:0c:00.0: Device:C2:B0:T0:L0 Command:0x28 CC:04/3e/01 has hardware error. Can you check what's going on there? :) Thanks!

Dan Parsons [:lerxst]

Assignee

Comment 1

•

12 years ago

Looks like a hardware problem to me too. Passing over to dcops for eval / drive swap. :decoder, do you have backups of any important data on the drives?

Assignee: server-ops-virtualization → server-ops-dcops

Component: Server Operations: Virtualization → Server Operations: DCOps

QA Contact: dparsons → dmoore

Christian Holler (:decoder)

Reporter

Comment 2

•

12 years ago

(In reply to Dan Parsons [:lerxst] from comment #1) > :decoder, do you have backups of any important data on the drives? There is no critical data on any of the drives. It would of course be good if content could be preserved, but they can be setup again from scratch with some effort :)

Van Le [:van]

Comment 3

•

12 years ago

:dumitru/SRE, any insight on this? Both drives are showing an amber light which indicates there's an issue but it's uncommon for both drives to fail at once. I'm also not seeing any issues reported by iLO IML either. :lerxst, I can replace both drives if it's a hardware issue. Should I kick it back to you when done?

Dumitru Gherman [:dumitru]

Comment 4

•

12 years ago

09:50 <van> any insight on this 874414 09:51 <van> you're the expert on drives 09:51 <van> and firmware 09:52 <dumitru> so, if that's running ESX, we don't have any online tools to verify the drives 09:52 <dumitru> you need to boot off the Smartstart DVD and run offline diagnosis 09:53 <van> ugh 09:53 <van> :( 09:53 <van> ok

Van Le [:van]

Comment 5

•

12 years ago

These are HP branded SAS drives so it's not the typical firmware issues we've had with the SSD drives. We happen to have a lot of spare 300GB drives so I've swapped out these failed drives and I'll run tests on them using a spare blade. Punting back to Dan's team.

Van Le [:van]

Updated

•

12 years ago

Assignee: server-ops-dcops → server-ops-virtualization

Component: Server Operations: DCOps → Server Operations: Virtualization

QA Contact: dmoore → dparsons

Dan Parsons [:lerxst]

Assignee

Comment 6

•

12 years ago

:van, before I go through reinstalling ESX, can you tell me what errors were reported by the raid controller? I'd hate to reinstall and find out it's the controller that's bad and not the drives...

Assignee: server-ops-virtualization → dparsons

Van Le [:van]

Comment 7

•

12 years ago

:lerxst, no errors were reported on the RAID controller through iLO's IML. I pinged :gcox to try to log in and we tried several passwords without any luck.

Dan Parsons [:lerxst]

Assignee

Comment 8

•

12 years ago

:van, you tried logging into what, ESX? This server is managed by Security Assurance, presumably they changed the password.

Greg Cox [:gcox]

Comment 9

•

12 years ago

I tried, with no expectation of success, earlier today: <van> is the root pw for vsphere on the sysadmin gpg file/ <van> or better yet, can you log into secfuzzesx.sec.scl3.mozilla.com <van> and let me know if you're seeing anything? <gcox> So, that box was never one we administered; lerxst set it up and handed it off blind. So, any passwords are probably long gone, or theirs instead of ours. I'll try a couple, but the better people to ask are the box owners. :) <van> ok thanks, i tried pinging decoder in irc <gcox> Yeah, I tried, but I'm useless here, sorry.

Dan Parsons [:lerxst]

Assignee

Comment 10

•

12 years ago

:decoder, was your group monitoring the hardware health on the system? Is it possible one drive died and no one noticed because RAID kept everything working, and then when the second drive died, finally people noticed? I'm trying to prevent this from happening again.

Dan Parsons [:lerxst]

Assignee

Comment 11

•

12 years ago

I booted up the blade and found signs of RAID controller failure: 1726-Slot 0 Drive Array - Array Accelerator Memory Size Change Detected Seems like possibly bad RAM in the RAID controller. Can dcops please fix/replace as necessary?

Component: Server Operations: Virtualization → Server Operations: DCOps

QA Contact: dparsons → dmoore

Christian Holler (:decoder)

Reporter

Comment 12

•

12 years ago

(In reply to Dan Parsons [:lerxst] from comment #10) > :decoder, was your group monitoring the hardware health on the system? I am not aware that we did, no. I also gave van the root password, in case you guys need it.

Al Billings [:abillings - ex-MoCo]

Comment 13

•

12 years ago

I have them as well if it comes up again.

Van Le [:van]

Comment 14

•

12 years ago

:lerxst, thanks. That error did not come up the first time I rebooted the host. It makes a lot more sense that both drives failed at once due to a bad RAID controller. The drive hardware surface scan also finally finished and one of the drives failed the "Scattered Read Test". I have contacted HP to send a tech on site to replace the bad RAID controller, cache memory and hard drive. [Tuesday, May 21, 2013 5:38 PM] -- Abdul Kader S says: As of now, I am processing the case for onsite replacement of the board, cache memory and hard drive. The case # is 4644950245 I'll kick this bug back to you once the DCOPs portion is done.

Whiteboard: HP Case ID 4644950245

Van Le [:van]

Updated

•

12 years ago

Assignee: dparsons → vle

Van Le [:van]

Updated

•

12 years ago

colo-trip: --- → scl3

Van Le [:van]

Comment 15

•

12 years ago

HP came on site to swap out the bad hardware. However, they noticed that one of the capacitors on the system board might be bad after some tests. He has ordered additional parts and will swing back tomorrow around 10AM for repairs.

Van Le [:van]

Updated

•

12 years ago

Depends on: 875117

Van Le [:van]

Comment 16

•

12 years ago

HP came on site to replace remaining bad hardware. I've opened 875117 to update the RAID firmware/BIOS as we're a few revisions behind. Punting ticket back to Dan's team.

Assignee: vle → server-ops-virtualization

Component: Server Operations: DCOps → Server Operations: Virtualization

QA Contact: dmoore → dparsons

Dan Parsons [:lerxst]

Assignee

Updated

•

12 years ago

Assignee: server-ops-virtualization → dparsons

Dan Parsons [:lerxst]

Assignee

Comment 17

•

12 years ago

Well, this is weird. I just logged into iLO to reinstall ESX on this system and miraculously, it is already installed. I guess maybe the RAID came back? Anyway, can someone log in and make sure things look good to you?

Dan Parsons [:lerxst]

Assignee

Comment 18

•

12 years ago

No response for 3 days, so resetting to 'normal'. Can anyone test the box to make sure it's working for you?

Severity: critical → normal

Christian Holler (:decoder)

Reporter

Comment 19

•

12 years ago

Sorry, I am on PTO. I will be testing on Monday :)

Al Billings [:abillings - ex-MoCo]

Comment 20

•

12 years ago

Sorry. Decoder had the week off and since he reported the initial problem and is using the box, I wanted him to be the one to say so if it was ok now.

Christian Holler (:decoder)

Reporter

Comment 21

•

12 years ago

Al tried to start our virtual machines but ESX says the license expired. Can you please check that and put the proper license in there? Thanks.

Al Billings [:abillings - ex-MoCo]

Comment 22

•

12 years ago

Specifically, it won't let us start any of the virtual machines because it says our evaluation license has expired and we need a license.

Al Billings [:abillings - ex-MoCo]

Comment 23

•

12 years ago

Two weeks later. Can we get a license please?

Flags: needinfo?(dparsons)

Dan Parsons [:lerxst]

Assignee

Comment 24

•

12 years ago

Apologies for the delay on this. The deal with this server, from the beginning, is that it is not managed by my team. Which means you're on your own for licensing. If you go to vmware.com and make a free account, you can get a license for the free edition of ESXi and put that in. Look over the differences between free and pay-for versions first.

Flags: needinfo?(dparsons)

Al Billings [:abillings - ex-MoCo]

Comment 25

•

12 years ago

All right. It is just kind of a surprise that it works and then suddenly it doesn't work. I don't think anyone ever told us that we'd need to go out on our own and figure out our own licensing.

Al Billings [:abillings - ex-MoCo]

Comment 26

•

12 years ago

License key installed. VMs booted. All yours, Decoder.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Infrastructure & Operations