Closed
Bug 874414
Opened 11 years ago
Closed 11 years ago
Security Assurance ESX server seems to have hardware failure
Categories
(Infrastructure & Operations :: Virtualization, task)
Infrastructure & Operations
Virtualization
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: decoder, Assigned: dparsons)
References
Details
(Whiteboard: HP Case ID 4644950245)
The ESX server we got in bug 765174 seems to have a disk problem. I first noticed that the second VM on that host had I/O errors and a read-only filesystem, then it hung up completely. The first one was still working but is gone now too. I logged into the ESX server via SSH and saw this in dmesg (tons): 2013-05-21T12:14:28.421Z cpu6:4102)ScsiDeviceIO: 2316: Cmd(0x412400803b00) 0x28, CmdSN 0x955ecd from world 4116 to dev "naa.600508b1001030364238313730300d00" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1. 2013-05-21T12:14:30.317Z cpu2:4098)<4>hpsa 0000:0c:00.0: Device:C2:B0:T0:L0 Command:0x28 CC:04/3e/01 has hardware error. Can you check what's going on there? :) Thanks!
Assignee | ||
Comment 1•11 years ago
|
||
Looks like a hardware problem to me too. Passing over to dcops for eval / drive swap. :decoder, do you have backups of any important data on the drives?
Assignee: server-ops-virtualization → server-ops-dcops
Component: Server Operations: Virtualization → Server Operations: DCOps
QA Contact: dparsons → dmoore
Reporter | ||
Comment 2•11 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #1) > :decoder, do you have backups of any important data on the drives? There is no critical data on any of the drives. It would of course be good if content could be preserved, but they can be setup again from scratch with some effort :)
Comment 3•11 years ago
|
||
:dumitru/SRE, any insight on this? Both drives are showing an amber light which indicates there's an issue but it's uncommon for both drives to fail at once. I'm also not seeing any issues reported by iLO IML either. :lerxst, I can replace both drives if it's a hardware issue. Should I kick it back to you when done?
Comment 4•11 years ago
|
||
09:50 <van> any insight on this 874414 09:51 <van> you're the expert on drives 09:51 <van> and firmware 09:52 <dumitru> so, if that's running ESX, we don't have any online tools to verify the drives 09:52 <dumitru> you need to boot off the Smartstart DVD and run offline diagnosis 09:53 <van> ugh 09:53 <van> :( 09:53 <van> ok
Comment 5•11 years ago
|
||
These are HP branded SAS drives so it's not the typical firmware issues we've had with the SSD drives. We happen to have a lot of spare 300GB drives so I've swapped out these failed drives and I'll run tests on them using a spare blade. Punting back to Dan's team.
Updated•11 years ago
|
Assignee: server-ops-dcops → server-ops-virtualization
Component: Server Operations: DCOps → Server Operations: Virtualization
QA Contact: dmoore → dparsons
Assignee | ||
Comment 6•11 years ago
|
||
:van, before I go through reinstalling ESX, can you tell me what errors were reported by the raid controller? I'd hate to reinstall and find out it's the controller that's bad and not the drives...
Assignee: server-ops-virtualization → dparsons
Comment 7•11 years ago
|
||
:lerxst, no errors were reported on the RAID controller through iLO's IML. I pinged :gcox to try to log in and we tried several passwords without any luck.
Assignee | ||
Comment 8•11 years ago
|
||
:van, you tried logging into what, ESX? This server is managed by Security Assurance, presumably they changed the password.
Comment 9•11 years ago
|
||
I tried, with no expectation of success, earlier today: <van> is the root pw for vsphere on the sysadmin gpg file/ <van> or better yet, can you log into secfuzzesx.sec.scl3.mozilla.com <van> and let me know if you're seeing anything? <gcox> So, that box was never one we administered; lerxst set it up and handed it off blind. So, any passwords are probably long gone, or theirs instead of ours. I'll try a couple, but the better people to ask are the box owners. :) <van> ok thanks, i tried pinging decoder in irc <gcox> Yeah, I tried, but I'm useless here, sorry.
Assignee | ||
Comment 10•11 years ago
|
||
:decoder, was your group monitoring the hardware health on the system? Is it possible one drive died and no one noticed because RAID kept everything working, and then when the second drive died, finally people noticed? I'm trying to prevent this from happening again.
Assignee | ||
Comment 11•11 years ago
|
||
I booted up the blade and found signs of RAID controller failure: 1726-Slot 0 Drive Array - Array Accelerator Memory Size Change Detected Seems like possibly bad RAM in the RAID controller. Can dcops please fix/replace as necessary?
Component: Server Operations: Virtualization → Server Operations: DCOps
QA Contact: dparsons → dmoore
Reporter | ||
Comment 12•11 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #10) > :decoder, was your group monitoring the hardware health on the system? I am not aware that we did, no. I also gave van the root password, in case you guys need it.
Comment 13•11 years ago
|
||
I have them as well if it comes up again.
Comment 14•11 years ago
|
||
:lerxst, thanks. That error did not come up the first time I rebooted the host. It makes a lot more sense that both drives failed at once due to a bad RAID controller. The drive hardware surface scan also finally finished and one of the drives failed the "Scattered Read Test". I have contacted HP to send a tech on site to replace the bad RAID controller, cache memory and hard drive. [Tuesday, May 21, 2013 5:38 PM] -- Abdul Kader S says: As of now, I am processing the case for onsite replacement of the board, cache memory and hard drive. The case # is 4644950245 I'll kick this bug back to you once the DCOPs portion is done.
Whiteboard: HP Case ID 4644950245
Updated•11 years ago
|
Assignee: dparsons → vle
Updated•11 years ago
|
colo-trip: --- → scl3
Comment 15•11 years ago
|
||
HP came on site to swap out the bad hardware. However, they noticed that one of the capacitors on the system board might be bad after some tests. He has ordered additional parts and will swing back tomorrow around 10AM for repairs.
Comment 16•11 years ago
|
||
HP came on site to replace remaining bad hardware. I've opened 875117 to update the RAID firmware/BIOS as we're a few revisions behind. Punting ticket back to Dan's team.
Assignee: vle → server-ops-virtualization
Component: Server Operations: DCOps → Server Operations: Virtualization
QA Contact: dmoore → dparsons
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops-virtualization → dparsons
Assignee | ||
Comment 17•11 years ago
|
||
Well, this is weird. I just logged into iLO to reinstall ESX on this system and miraculously, it is already installed. I guess maybe the RAID came back? Anyway, can someone log in and make sure things look good to you?
Assignee | ||
Comment 18•11 years ago
|
||
No response for 3 days, so resetting to 'normal'. Can anyone test the box to make sure it's working for you?
Severity: critical → normal
Reporter | ||
Comment 19•11 years ago
|
||
Sorry, I am on PTO. I will be testing on Monday :)
Comment 20•11 years ago
|
||
Sorry. Decoder had the week off and since he reported the initial problem and is using the box, I wanted him to be the one to say so if it was ok now.
Reporter | ||
Comment 21•11 years ago
|
||
Al tried to start our virtual machines but ESX says the license expired. Can you please check that and put the proper license in there? Thanks.
Comment 22•11 years ago
|
||
Specifically, it won't let us start any of the virtual machines because it says our evaluation license has expired and we need a license.
Assignee | ||
Comment 24•11 years ago
|
||
Apologies for the delay on this. The deal with this server, from the beginning, is that it is not managed by my team. Which means you're on your own for licensing. If you go to vmware.com and make a free account, you can get a license for the free edition of ESXi and put that in. Look over the differences between free and pay-for versions first.
Flags: needinfo?(dparsons)
Comment 25•11 years ago
|
||
All right. It is just kind of a surprise that it works and then suddenly it doesn't work. I don't think anyone ever told us that we'd need to go out on our own and figure out our own licensing.
Comment 26•11 years ago
|
||
License key installed. VMs booted. All yours, Decoder.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•