If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

attach linux-ix-slave06 to buildbot-master3.b.m.o when it's back

RESOLVED FIXED

Status

Release Engineering
General
P3
normal
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: dustin, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildslaves][hardware][slaveduty])

(Reporter)

Description

7 years ago
The 'rm_configs' step of builder "Android R7 tryserver build" has been failing repeatedly on slave linux-ix-slave06.

rm -rf configs
 in dir /builds/slave/try-mb-br-andrd-r7-bld/build (timeout 1200 secs)
 watching logfiles {}
 argv: ['rm', '-rf', 'configs']
 environment:
...
 closing stdin
 using PTY: True

command timed out: 1200 seconds without output, killing pid 3638
process killed by signal 9
program finished with exit code -1
elapsedTime=1200.682441

I checked earlier, passing builds, and this was taking steadily longer for each build:

<several failures here>
Sat Jan 8 06:11:05 2011: 19m 13s
<failed here>
Sat Jan 8 00:41:16 2011: 18m 23s
Fri Jan 7 21:53:04 2011: 17m 24s
Fri Jan 7 17:36:04 2011: 13m 36s

I don't know exactly how this directory is checked out, from my quick look through the preceding steps, but it's a checkout of build/buildbot-configs, so it's not huge by any means.  The steadily increasing time to rm is an interesting data point!

I've disabled buildbot on this slave and added it to the slave trackign spreadsheet.
(Reporter)

Comment 1

7 years ago
[cltbld@linux-ix-slave06 slave]$ mv buildbot.tac buildbot.tac.bug624210
[cltbld@linux-ix-slave06 slave]$ touch DO_NOT_START
(Reporter)

Comment 2

7 years ago
I stopped the build during the hg_update step.  I don't know if configs would have been filled with more cruft before being removed, but it certainly didn't take long to remove by hand:

[cltbld@linux-ix-slave06 build]$ time rm -rf configs/
real    0m0.040s
user    0m0.002s
sys     0m0.010s

dmesg ends with the usual
 eth0: no IPv6 routers present
so I don't see anything indicating hardware failures there.

A mystery for the ages?
(Reporter)

Updated

7 years ago
Assignee: dustin → nobody
(Reporter)

Comment 3

7 years ago
I also scheduled 4 days of downtime for this host in nagios.
Duplicate of this bug: 624206
Summary: linux-ix-slave06 taking ~20M to remove 954 files (4.0M) → linux-ix-slave06 taking ~20mins to remove 954 files (4.0M)
Sounds like another ix with slow/sad disk. 

zandr, your already investigating a batch of those in other bugs, is this machine already on your list (in which case, we'll close as DUP), or is this problem a new report (in which case, I guess we push it you?).
Sad, but not very sad. Here's the usual test data for iX:

[root@linux-ix-slave06 ~]# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
	Model Number:       ST3250318AS                             
	Serial Number:      9VY95DR1
	Firmware Revision:  CC38    
Transport: Serial
Standards:
	Supported: 8 7 6 5 
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  488397168
	device size with M = 1024*1024:      238475 MBytes
	device size with M = 1000*1000:      250059 MBytes (250 GB)
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = ?
	Recommended acoustic management value: 254, current value: 0
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 *udma3 udma4 udma5 udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	SATA-I signaling speed (1.5Gb/s)
	   *	SATA-II signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	    	Device-initiated interface power management
	   *	Software settings preservation
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave06 ~]# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   29336 MB in  1.99 seconds = 14738.11 MB/sec
 Timing buffered disk reads:  280 MB in  3.01 seconds =  93.00 MB/sec
(Reporter)

Updated

7 years ago
Blocks: 596366
See Also: → bug 606716

Comment 7

7 years ago
Are we still following up on this? What's the current state?

Updated

7 years ago
Priority: -- → P3
When this slave is ready to come online again, it should point to the new try master in MV as per bug 617321 (either to test-master02.b.m.o or buildbot-master3.b.m.o depending on whether bug 627803 has been fixed yet)
(In reply to comment #6)
> Sad, but not very sad. Here's the usual test data for iX:
> 
> [root@linux-ix-slave06 ~]# hdparm -I /dev/sda
> 
> /dev/sda:
> 
> ATA device, with non-removable media
>     Model Number:       ST3250318AS                             
>     Serial Number:      9VY95DR1
>     Firmware Revision:  CC38    
> Transport: Serial
> Standards:
>     Supported: 8 7 6 5 
>     Likely used: 8
> Configuration:
>     Logical        max    current
>     cylinders    16383    16383
>     heads        16    16
>     sectors/track    63    63
>     --
>     CHS current addressable sectors:   16514064
>     LBA    user addressable sectors:  268435455
>     LBA48  user addressable sectors:  488397168
>     device size with M = 1024*1024:      238475 MBytes
>     device size with M = 1000*1000:      250059 MBytes (250 GB)
> Capabilities:
>     LBA, IORDY(can be disabled)
>     Queue depth: 32
>     Standby timer values: spec'd by Standard, no device specific minimum
>     R/W multiple sector transfer: Max = 16    Current = ?
>     Recommended acoustic management value: 254, current value: 0
>     DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 *udma3 udma4 udma5 udma6 
>          Cycle time: min=120ns recommended=120ns
>     PIO: pio0 pio1 pio2 pio3 pio4 
>          Cycle time: no flow control=120ns  IORDY flow control=120ns
> Commands/features:
>     Enabled    Supported:
>        *    SMART feature set
>             Security Mode feature set
>        *    Power Management feature set
>        *    Write cache
>        *    Look-ahead
>        *    Host Protected Area feature set
>        *    WRITE_BUFFER command
>        *    READ_BUFFER command
>        *    DOWNLOAD_MICROCODE
>             SET_MAX security extension
>        *    Automatic Acoustic Management feature set
>        *    48-bit Address feature set
>        *    Device Configuration Overlay feature set
>        *    Mandatory FLUSH_CACHE
>        *    FLUSH_CACHE_EXT
>        *    SMART error logging
>        *    SMART self-test
>        *    General Purpose Logging feature set
>        *    WRITE_{DMA|MULTIPLE}_FUA_EXT
>        *    64-bit World wide name
>             Write-Read-Verify feature set
>        *    WRITE_UNCORRECTABLE command
>        *    {READ,WRITE}_DMA_EXT_GPL commands
>        *    Segmented DOWNLOAD_MICROCODE
>        *    SATA-I signaling speed (1.5Gb/s)
>        *    SATA-II signaling speed (3.0Gb/s)
>        *    Native Command Queueing (NCQ)
>        *    Phy event counters
>             Device-initiated interface power management
>        *    Software settings preservation
> Security: 
>     Master password revision code = 65534
>         supported
>     not    enabled
>     not    locked
>     not    frozen
>     not    expired: security count
>         supported: enhanced erase
>     42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT.
> Checksum: correct
> [root@linux-ix-slave06 ~]# hdparm -tT /dev/sda
> 
> /dev/sda:
>  Timing cached reads:   29336 MB in  1.99 seconds = 14738.11 MB/sec
>  Timing buffered disk reads:  280 MB in  3.01 seconds =  93.00 MB/sec
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 596366
(Reporter)

Comment 10

7 years ago
reopening and morphing for comment 8
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Summary: linux-ix-slave06 taking ~20mins to remove 954 files (4.0M) → attach linux-ix-slave06 to buildbot-master3.b.m.o when it's back

Updated

7 years ago
Whiteboard: [buildslaves][hardware][buildduty] → [buildslaves][hardware][slaveduty]
Shouldn't this bug depend on bug 596366, rather than the other way around?
No longer blocks: 596366
Depends on: 596366
(Reporter)

Comment 12

7 years ago
The slave is in scl1, and should point to one of the scl1 try masters, not the master in mtv1.  That can happen with slavealloc.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
(Assignee)

Updated

4 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.