Closed Bug 624207 Opened 13 years ago Closed 13 years ago

linux-ix-slave42 dead - needs to be returned for repairs

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 596366

People

(Reporter: dustin, Unassigned)

Details

(Whiteboard: [buildslaves][hardware][buildduty][subject to embargo])

Zandr, I believe when this had trouble yesterday or Thursday you put a wager on the slave being a dud.  I think you were right - I just rebooted it manually via IPMI after Nagios started sending no-PING alerts for it.  You can dup this to the appropriate bug.

The machine appears to be back in the game right now, but who knows what will happen at its next reboot.
Ah, finally the nagios interface loaded the logs - it's been doing this for a few days now, at least.  It's hard to tell whether it's eventually coming back up on its own, or whether releng hands have touched it before each recovery (this lack of permanent record is something border collie should remedy).

I'm going to take this slave out of the build pool for the moment:
[root@linux-ix-slave42 slave]# mv buildbot.tac buildbot.tac.bug624207
[root@linux-ix-slave42 slave]# touch DO_NOT_START

It's added to the buildduty slave-tracking google doc.  I'll also mark it for a 2-day downtime - that can be extended as necessary.
Whiteboard: [buildslaves][hardware][buildduty]
Assignee: zandr → dustin
I'm going to try a reimage and see if that helps. The disks don't obviously show any sign of trouble.
Usual drive data for iX:
[root@linux-ix-slave42 ~]# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
	Model Number:       ST3250318AS                             
	Serial Number:      9VY97RF7
	Firmware Revision:  CC38    
Transport: Serial
Standards:
	Supported: 8 7 6 5 
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  488397168
	device size with M = 1024*1024:      238475 MBytes
	device size with M = 1000*1000:      250059 MBytes (250 GB)
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = ?
	Recommended acoustic management value: 254, current value: 0
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	SATA-I signaling speed (1.5Gb/s)
	   *	SATA-II signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	    	Device-initiated interface power management
	   *	Software settings preservation
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave42 ~]# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   29332 MB in  1.99 seconds = 14735.98 MB/sec
 Timing buffered disk reads:  340 MB in  3.00 seconds = 113.23 MB/sec
[root@linux-ix-slave42 ~]#
buildbot stopped because wrong keys attached to it?
I put the correct keys on the slave and restarted it.  Let's see if/how long it lasts.  I think zandr has some money riding on that question.
This box is unpingable again.  I'll add a link here from the reboots bug.  Maybe it's just a regular fail-to-reboot, but maybe it's worse.
Assignee: dustin → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
It's not showing any overt signs of disk problems. I'll reimage before we write it off/send it back, but we're going that way.
You'll be shocked to hear that it's unpingable again.  It was running fine in staging for a few days before that.
Powered off via IPMI, and I'll send it out for repair.
Whiteboard: [buildslaves][hardware][buildduty] → [buildslaves][hardware][buildduty][subject to embargo]
Summary: linux-ix-slave42 rebooted via IPMI - bum disks? → linux-ix-slave42 dead - needs to be returned for repairs
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.