Server bugzilla3.community.scl3.mozilla.com disk seems to be failing again

RESOLVED WONTFIX

Status

mozilla.org Graveyard
Server Operations: MOC
RESOLVED WONTFIX
4 years ago
3 years ago

People

(Reporter: wicked, Unassigned)

Tracking

Details

(Reporter)

Description

4 years ago
Server bugzilla3.community.scl3.mozilla.com seems to be complaining about its disk. It's log is full of:

--!--
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: BMDMA stat 0x24
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: failed command: READ DMA
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: cmd c8/00:20:a0:41:74/00:00:00:00:00/e0 tag 30 dma 16384 in
Oct  8 18:39:20 bugzilla3 kernel:         res 51/40:00:bd:41:74/00:00:00:00:00/00 Emask 0x9 (media error)
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: status: { DRDY ERR }
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: error: { UNC }
Oct  8 18:39:20 bugzilla3 kernel: ata1.00: configured for UDMA/133
Oct  8 18:39:20 bugzilla3 kernel: sd 0:0:0:0: [sda] Unhandled sense code
Oct  8 18:39:20 bugzilla3 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct  8 18:39:20 bugzilla3 kernel: sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Oct  8 18:39:20 bugzilla3 kernel: Descriptor sense data with sense descriptors (in hex):
Oct  8 18:39:20 bugzilla3 kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Oct  8 18:39:20 bugzilla3 kernel:        00 74 41 bd
Oct  8 18:39:20 bugzilla3 kernel: sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Oct  8 18:39:20 bugzilla3 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 74 41 a0 00 00 20 00
Oct  8 18:39:20 bugzilla3 kernel: ata1: EH complete
--!--

This does seem the disk is going to fail again. Last time this happened (see the cloned bug 895755) was about a year ago so it's a pity these disk don't seem to last more than a year. :(

Can you verify if the disk or something else has failed or is about to fail and replace any needed parts? Same procedure as last time is fine for me. You can also ask justdave if you have any questions (especially any that require "internal MoCO" knowledge).

I already have recent backup of the server and in fact, luckily, I haven't yet brought it fully back to production after last incident. :) Therefore, it's fine to bring it down for maintenance any time. Thank you!
Yes, once you're getting errors like that the disk is usually dying. I can't log into the machine (port 22 just drops connections from scl3 admin hosts, etc) to look any further.

Looks like the warranty expired on 2013-4-12, according to inventory. I can hand this over to dcops to get them to open an RMA that drive but it may cost money to replace.

dcops: can you find out from iX if they'll replace the drive or if it needs paying for?

Updated

4 years ago
Assignee: server-ops → server-ops-dcops
Component: Server Operations → Server Operations: DCOps
QA Contact: shyam → dmoore

Comment 2

4 years ago
Hard disk was no longer detected in BIOS.  I've replaced it with a new drive.  

:pir - Can you kickstart it?

Updated

4 years ago
Assignee: server-ops-dcops → nobody
Component: Server Operations: DCOps → Server Operations: MOC
I set it to network boot but it receives no offers. I'm assuming we can't do that on the community network. I don't know the processes for dealing with community it.

:justdave, you've dealt with this machine before, do you know?
Flags: needinfo?(justdave)
Last time we did this we had to have netops move the switch port to the ops vlan and kickstart it there, then move it back after it was kickstarted.  Hold off on bothering with it yet, we were discussing this on IRC and since the machines are all flaky and out-of-warranty we think we're just going to move them all to VMs, so it's probably not worth trying to load an OS on it again.
Flags: needinfo?(justdave)
VMs seem like a much better idea. Closing this bug, reopen if anything is needed on the host in question.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
(Assignee)

Updated

3 years ago
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.