attach linux-ix-slave06 to buildbot-master3.b.m.o when it's back

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
8 years ago
6 years ago

People

(Reporter: dustin, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildslaves][hardware][slaveduty])

The 'rm_configs' step of builder "Android R7 tryserver build" has been failing repeatedly on slave linux-ix-slave06.

rm -rf configs
 in dir /builds/slave/try-mb-br-andrd-r7-bld/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['rm', '-rf', 'configs']
 environment:
...
 closing stdin
 using PTY: True

command timed out: 1200 seconds without output, killing pid 3638
process killed by signal 9
program finished with exit code -1
elapsedTime=1200.682441

I checked earlier, passing builds, and this was taking steadily longer for each build:

<several failures here>
Sat Jan 8 06:11:05 2011: 19m 13s
<failed here>
Sat Jan 8 00:41:16 2011: 18m 23s
Fri Jan 7 21:53:04 2011: 17m 24s
Fri Jan 7 17:36:04 2011: 13m 36s

I don't know exactly how this directory is checked out, from my quick look through the preceding steps, but it's a checkout of build/buildbot-configs, so it's not huge by any means.  The steadily increasing time to rm is an interesting data point!

I've disabled buildbot on this slave and added it to the slave trackign spreadsheet.
[cltbld@linux-ix-slave06 slave]$ mv buildbot.tac buildbot.tac.bug624210
[cltbld@linux-ix-slave06 slave]$ touch DO_NOT_START
I stopped the build during the hg_update step.  I don't know if configs would have been filled with more cruft before being removed, but it certainly didn't take long to remove by hand:

[cltbld@linux-ix-slave06 build]$ time rm -rf configs/
real    0m0.040s
user    0m0.002s
sys     0m0.010s

dmesg ends with the usual
 eth0: no IPv6 routers present
so I don't see anything indicating hardware failures there.

A mystery for the ages?
Assignee: dustin → nobody
I also scheduled 4 days of downtime for this host in nagios.
Duplicate of this bug: 624206
Summary: linux-ix-slave06 taking ~20M to remove 954 files (4.0M) → linux-ix-slave06 taking ~20mins to remove 954 files (4.0M)
Sounds like another ix with slow/sad disk. 

zandr, your already investigating a batch of those in other bugs, is this machine already on your list (in which case, we'll close as DUP), or is this problem a new report (in which case, I guess we push it you?).
Sad, but not very sad. Here's the usual test data for iX:

[root@linux-ix-slave06 ~]# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
	Model Number:       ST3250318AS                             
	Serial Number:      9VY95DR1
	Firmware Revision:  CC38    
Transport: Serial
Standards:
	Supported: 8 7 6 5 
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  488397168
	device size with M = 1024*1024:      238475 MBytes
	device size with M = 1000*1000:      250059 MBytes (250 GB)
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = ?
	Recommended acoustic management value: 254, current value: 0
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 *udma3 udma4 udma5 udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	SATA-I signaling speed (1.5Gb/s)
	   *	SATA-II signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	    	Device-initiated interface power management
	   *	Software settings preservation
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave06 ~]# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   29336 MB in  1.99 seconds = 14738.11 MB/sec
 Timing buffered disk reads:  280 MB in  3.01 seconds =  93.00 MB/sec
Blocks: 596366
See Also: → bug 606716
Are we still following up on this? What's the current state?
Priority: -- → P3
When this slave is ready to come online again, it should point to the new try master in MV as per bug 617321 (either to test-master02.b.m.o or buildbot-master3.b.m.o depending on whether bug 627803 has been fixed yet)
(In reply to comment #6)
> Sad, but not very sad. Here's the usual test data for iX:
> 
> [root@linux-ix-slave06 ~]# hdparm -I /dev/sda
> 
> /dev/sda:
> 
> ATA device, with non-removable media
>     Model Number:       ST3250318AS                             
>     Serial Number:      9VY95DR1
>     Firmware Revision:  CC38    
> Transport: Serial
> Standards:
>     Supported: 8 7 6 5 
>     Likely used: 8
> Configuration:
>     Logical        max    current
>     cylinders    16383    16383
>     heads        16    16
>     sectors/track    63    63
>     --
>     CHS current addressable sectors:   16514064
>     LBA    user addressable sectors:  268435455
>     LBA48  user addressable sectors:  488397168
>     device size with M = 1024*1024:      238475 MBytes
>     device size with M = 1000*1000:      250059 MBytes (250 GB)
> Capabilities:
>     LBA, IORDY(can be disabled)
>     Queue depth: 32
>     Standby timer values: spec'd by Standard, no device specific minimum
>     R/W multiple sector transfer: Max = 16    Current = ?
>     Recommended acoustic management value: 254, current value: 0
>     DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 *udma3 udma4 udma5 udma6 
>          Cycle time: min=120ns recommended=120ns
>     PIO: pio0 pio1 pio2 pio3 pio4 
>          Cycle time: no flow control=120ns  IORDY flow control=120ns
> Commands/features:
>     Enabled    Supported:
>        *    SMART feature set
>             Security Mode feature set
>        *    Power Management feature set
>        *    Write cache
>        *    Look-ahead
>        *    Host Protected Area feature set
>        *    WRITE_BUFFER command
>        *    READ_BUFFER command
>        *    DOWNLOAD_MICROCODE
>             SET_MAX security extension
>        *    Automatic Acoustic Management feature set
>        *    48-bit Address feature set
>        *    Device Configuration Overlay feature set
>        *    Mandatory FLUSH_CACHE
>        *    FLUSH_CACHE_EXT
>        *    SMART error logging
>        *    SMART self-test
>        *    General Purpose Logging feature set
>        *    WRITE_{DMA|MULTIPLE}_FUA_EXT
>        *    64-bit World wide name
>             Write-Read-Verify feature set
>        *    WRITE_UNCORRECTABLE command
>        *    {READ,WRITE}_DMA_EXT_GPL commands
>        *    Segmented DOWNLOAD_MICROCODE
>        *    SATA-I signaling speed (1.5Gb/s)
>        *    SATA-II signaling speed (3.0Gb/s)
>        *    Native Command Queueing (NCQ)
>        *    Phy event counters
>             Device-initiated interface power management
>        *    Software settings preservation
> Security: 
>     Master password revision code = 65534
>         supported
>     not    enabled
>     not    locked
>     not    frozen
>     not    expired: security count
>         supported: enhanced erase
>     42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT.
> Checksum: correct
> [root@linux-ix-slave06 ~]# hdparm -tT /dev/sda
> 
> /dev/sda:
>  Timing cached reads:   29336 MB in  1.99 seconds = 14738.11 MB/sec
>  Timing buffered disk reads:  280 MB in  3.01 seconds =  93.00 MB/sec
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 596366
reopening and morphing for comment 8
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Summary: linux-ix-slave06 taking ~20mins to remove 954 files (4.0M) → attach linux-ix-slave06 to buildbot-master3.b.m.o when it's back

Updated

8 years ago
Whiteboard: [buildslaves][hardware][buildduty] → [buildslaves][hardware][slaveduty]
Shouldn't this bug depend on bug 596366, rather than the other way around?
No longer blocks: 596366
Depends on: 596366
The slave is in scl1, and should point to one of the scl1 try masters, not the master in mtv1.  That can happen with slavealloc.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.