Closed
Bug 624207
Opened 13 years ago
Closed 13 years ago
linux-ix-slave42 dead - needs to be returned for repairs
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 596366
People
(Reporter: dustin, Unassigned)
Details
(Whiteboard: [buildslaves][hardware][buildduty][subject to embargo])
Zandr, I believe when this had trouble yesterday or Thursday you put a wager on the slave being a dud. I think you were right - I just rebooted it manually via IPMI after Nagios started sending no-PING alerts for it. You can dup this to the appropriate bug. The machine appears to be back in the game right now, but who knows what will happen at its next reboot.
Reporter | ||
Comment 1•13 years ago
|
||
Ah, finally the nagios interface loaded the logs - it's been doing this for a few days now, at least. It's hard to tell whether it's eventually coming back up on its own, or whether releng hands have touched it before each recovery (this lack of permanent record is something border collie should remedy). I'm going to take this slave out of the build pool for the moment: [root@linux-ix-slave42 slave]# mv buildbot.tac buildbot.tac.bug624207 [root@linux-ix-slave42 slave]# touch DO_NOT_START It's added to the buildduty slave-tracking google doc. I'll also mark it for a 2-day downtime - that can be extended as necessary.
Reporter | ||
Updated•13 years ago
|
Whiteboard: [buildslaves][hardware][buildduty]
Reporter | ||
Updated•13 years ago
|
Assignee: zandr → dustin
Comment 2•13 years ago
|
||
I'm going to try a reimage and see if that helps. The disks don't obviously show any sign of trouble.
Comment 3•13 years ago
|
||
Usual drive data for iX: [root@linux-ix-slave42 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 9VY97RF7 Firmware Revision: CC38 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@linux-ix-slave42 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29332 MB in 1.99 seconds = 14735.98 MB/sec Timing buffered disk reads: 340 MB in 3.00 seconds = 113.23 MB/sec [root@linux-ix-slave42 ~]#
Comment 4•13 years ago
|
||
buildbot stopped because wrong keys attached to it?
Reporter | ||
Comment 5•13 years ago
|
||
I put the correct keys on the slave and restarted it. Let's see if/how long it lasts. I think zandr has some money riding on that question.
Reporter | ||
Comment 6•13 years ago
|
||
This box is unpingable again. I'll add a link here from the reboots bug. Maybe it's just a regular fail-to-reboot, but maybe it's worse.
Assignee: dustin → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Comment 7•13 years ago
|
||
It's not showing any overt signs of disk problems. I'll reimage before we write it off/send it back, but we're going that way.
Reporter | ||
Comment 8•13 years ago
|
||
You'll be shocked to hear that it's unpingable again. It was running fine in staging for a few days before that.
Comment 9•13 years ago
|
||
Powered off via IPMI, and I'll send it out for repair.
Updated•13 years ago
|
Whiteboard: [buildslaves][hardware][buildduty] → [buildslaves][hardware][buildduty][subject to embargo]
Updated•13 years ago
|
Summary: linux-ix-slave42 rebooted via IPMI - bum disks? → linux-ix-slave42 dead - needs to be returned for repairs
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•