Closed
Bug 596366
(ix-drive-issues)
Opened 14 years ago
Closed 13 years ago
latest batch of ix machines have slow and failing drives
Categories
(Infrastructure & Operations :: RelOps: General, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: zandr)
References
Details
(Whiteboard: [buildslaves][hardware][duptome][subject to embargo])
Attachments
(5 files)
I noticed during the 3.6.10 release that the latest batch of ix machines seem to run slower in terms of disk speed than the other ones. Some timing from hdparm:
[root@mv-moz2-linux-ix-slave03 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29532 MB in 2.00 seconds = 14801.87 MB/sec
Timing buffered disk reads: 360 MB in 3.00 seconds = 119.85 MB/sec
[root@mv-moz2-linux-ix-slave03 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29524 MB in 2.00 seconds = 14797.74 MB/sec
Timing buffered disk reads: 360 MB in 3.01 seconds = 119.80 MB/sec
[root@mv-moz2-linux-ix-slave03 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29480 MB in 2.00 seconds = 14776.34 MB/sec
Timing buffered disk reads: 356 MB in 3.01 seconds = 118.21 MB/sec
---------------
[root@linux-ix-slave07 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29336 MB in 1.99 seconds = 14738.42 MB/sec
Timing buffered disk reads: 256 MB in 3.02 seconds = 84.76 MB/sec
[root@linux-ix-slave07 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29336 MB in 1.99 seconds = 14738.13 MB/sec
Timing buffered disk reads: 262 MB in 3.01 seconds = 86.98 MB/sec
[root@linux-ix-slave07 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29332 MB in 1.99 seconds = 14738.71 MB/sec
Timing buffered disk reads: 270 MB in 3.03 seconds = 89.09 MB/sec
As far as I can tell, they're set-up exactly the same as the other ones, down to the hard drive firmware level. The filesystems are ext3, mounted with noatime. Haven't dug further than this.
Updated•14 years ago
|
Whiteboard: [buildslaves][hardware]
Comment 1•14 years ago
|
||
mrz:
I thought these machines were identical to the last batch. What is different about these new ix machines?
Assignee: nobody → mrz
Status: ASSIGNED → NEW
Component: Release Engineering → Server Operations
OS: Mac OS X → All
QA Contact: release → mrz
Comment 2•14 years ago
|
||
Nothing.
Assignee: mrz → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Reporter | ||
Comment 3•14 years ago
|
||
linux-ix-slave17 is repeatedly getting hg into an uninterruptible sleep when cloning a mozilla-1.9.2 for 3.6.11 tagging. Breaks tagging and requires a reboot. We should get IT to run diagnostics on at least one of these machines.
Severity: normal → major
Comment 4•14 years ago
|
||
I strongly suspect this bug for wasting most of my day today.
(timeout after rm -rf of ~17mb took >20min; 20min timeouts in several compiles in a row on linux-ix-slave02 that worked perfectly on mv-moz2-linux-ix-slave01.)
Comment 5•14 years ago
|
||
I did a quick set of tests and it looks like this might be a more widespread issue affecting all new IX boxes.
While nothing else was running on the machines, I ran the following two commands on both the linux and win32 machines:
time hg clone http://hg.mozilla.org/mozilla-central freshclone
time hg clone --pull --uncompressed freshclone copy
I found that the new batch of machines are in both cases slower than the original batch of ix machines. On linux-ix-slave02, the second command took 10 times longer than the old machines. The windows tests showed that the local clone operation took nearly twice as long. The breakdown of real, user and sys times was only available on the linux machines.
More detailed results below.
Win32
====================================================
on mw32-ix-slave01, hg clone http://.../mozilla-central freshclone took 12m38.
on w32-ix-slave02, hg clone http://.../mozilla-central freshclone took 14m31.
on mw32-ix-slave01, hg clone --pull --uncompressed freshclone copy took 10m50
on w32-ix-slave02, hg clone --pull --uncompressed freshclone copy took 18m52
Linux
====================================================
on mv-moz2-linux-ix-slave04, hg clone http://.../mozilla-central freshclone took
real 4m5
user 2m38
sys 0m11
on linux-ix-slave02, hg clone http://.../mozilla-central freshclone took
real 43m27
user 3m0
sys 0m11
on mv-moz2-linux-ix-slave04, hg clone --pull --uncompressed freshclone copy took
real 4m35
user 3m23
sys 0m12
on linux-ix-slave02, hg clone --pull --uncompressed freshclone copy took
real 14m56
user 3m49
sys 0m12
Summary: latest batch of linux ix machines seem to have slower disks → latest batch of ix machines have slow i/o
Comment 6•14 years ago
|
||
jabba/jlazaro: from comment#2, mrz asserts the hardware is identical.
Is there any diagnostics that can be run on these to explain the performance different? Or is there anything different about how these machines were imaged ?
Marking as critical, as its causing intermittent timeouts/hangs in production.
Assignee: nobody → server-ops
Severity: major → critical
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Updated•14 years ago
|
Assignee: server-ops → jlazaro
Comment 7•14 years ago
|
||
Contacted IX support via email, since this is hardware related
Comment 8•14 years ago
|
||
linux-ix-slave16 was taken out of production by nthomas last night, because its was taking 6 hours for a Linux maple leak test build (clobber) last night.
See attached bug#601623 for history of linux-ix-slave16 being sick a couple of weeks ago.
See Also: → 601623
Comment 9•14 years ago
|
||
Just took mv-moz2-linux-ix-slave02 and linux-ix-slave31 offline to loan out for investigation.
Comment 10•14 years ago
|
||
also handed off linux-ix-slave14
Comment 11•14 years ago
|
||
Confirming Lukas's comment
mv-moz2-linux-slave02
linux-ix-slave14
linux-ix-slave31 (scl)
These machines were taken by Chris Williams from IX Systems today to investigate the i/o issues, will report back when I receive an update from IX
Updated•14 years ago
|
Assignee: jlazaro → server-ops
Updated•14 years ago
|
Assignee: server-ops → jlazaro
Comment 12•14 years ago
|
||
from email with matt@ix systems:
They've confirmed performance differences on the few machines they took back from office to test. More debugging ongoing.
Comment 14•14 years ago
|
||
In order for IX to continue debugging these issues, we'll need to run tests on 37 machines with these serial numbers:
Group A
A1-14132
A1-14134
A1-14136
A1-14138
A1-14139
A1-14141
A1-16051
A1-16063
A1-16094
A1-16098
A1-16105
A1-16188
A1-16189
Group B
A1-14147
A1-14154
A1-14168
A1-14171
A1-14174
A1-14175
A1-16056
A1-16114
A1-16132
A1-16171
A1-16205
A1-16213
Group C
A1-14128
A1-14145
A1-14146
A1-14152
A1-14153
A1-14166
A1-16061
A1-16082
A1-16095
A1-16149
A1-16151
A1-16212
Test:
hdparm -I /dev/sda
hdparm -tT /dev/sda
We're hoping most of the machines from each group are linux machines since we don't have a tool for testing i/o on Windows. Would we need to schedule a downtime for this? Is this a concern that we won't have accurate results if these machines are in active production?
Comment 15•14 years ago
|
||
hdparm is available for windows iirc.
I dont know how to map those serials to hostnames, do we have a way to do that? I can remove the machines from production so that we get mostly-idle testing done.
Comment 16•14 years ago
|
||
Here is hdparm for windows. It requires administrator permissions to run and will require cygwin1.dll to either be in the same directory or in the %PATH% system variable.
I have checked and it looks like cygwin1.dll is the only dependency (other than kernel32.dll).
Comment 17•14 years ago
|
||
It looks like joduinn/buildduty will be working to get these tests/results to IX.
Although these machines are in inventory, the "quick search" option does not allow us to search by serial number ( https://bugzilla.mozilla.org/show_bug.cgi?id=607050 )
We might have a spreadsheet with the hostnames and serial numbers to reference by, and forward that to buildduty/joduinn once I find this info.
Updated•14 years ago
|
Assignee: jlazaro → joduinn
Updated•14 years ago
|
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Comment 18•14 years ago
|
||
Do you have the list of slaves which those serial numbers?
Comment 19•14 years ago
|
||
Please throw back to Release Engineering when you have the list.
Assignee: joduinn → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Updated•14 years ago
|
Assignee: server-ops → jlazaro
Comment 20•14 years ago
|
||
Group A
A1-14132 mv-moz2-linux-ix-slave12
A1-14134 mw32-ix-slave17
A1-14136 mw32-ix-slave13
A1-14138 mv-moz2-linux-ix-slave15
A1-14139 mv-moz2-linux-ix-slave02
A1-14141 mv-moz2-linux-ix-slave11
A1-16051 w32-ix-slave05
A1-16063 w32-ix-slave17
A1-16094 w32-ix-slave31
A1-16098 w32-ix-slave35
A1-16105 w32-ix-slave42
A1-16188 linux64-ix-slave16
A1-16189 linux64-ix-slave17
Group B
A1-14147 mv-moz2-linux-ix-slave08
A1-14154 mw64-ix-slave01
A1-14168 mw32-ix-slave07
A1-14171 mw32-ix-slave05
A1-14174 mw32-ix-slave10
A1-14175 mw32-ix-slave18
A1-16056 w32-ix-slave10
A1-16114 w64-ix-slave09
A1-16132 w64-ix-slave27
A1-16171 linux-ix-slave41
A1-16205 linux64-ix-slave33
A1-16213 linux64-ix-slave41
Group C
A1-14128 mw32-ix-slave23
A1-14145 mw32-ix-slave11
A1-14146 mw32-ix-slave22
A1-14152 mv-moz2-linux-ix-slave23
A1-14153 mv-moz2-linux-ix-slave19
A1-14166 mv-moz2-linux-ix-slave13
A1-16061 w32-ix-slave15
A1-16082 linux-ix-slave11
A1-16095 w32-ix-slave32
A1-16149 linux-ix-slave19
A1-16151 linux-ix-slave21
A1-16212 linux64-ix-slave40
Updated•14 years ago
|
Assignee: jlazaro → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Comment 22•14 years ago
|
||
[root@mv-moz2-linux-ix-slave12 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 5VY0LB7E
Firmware Revision: CC45
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 208, current value: 208
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@mv-moz2-linux-ix-slave12 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29384 MB in 1.99 seconds = 14728.98 MB/sec
Timing buffered disk reads: 376 MB in 3.00 seconds = 125.13 MB/sec
Comment 23•14 years ago
|
||
A1-14134 mw32-ix-slave17
A1-14136 mw32-ix-slave13
are both unreachable
Comment 24•14 years ago
|
||
[root@mv-moz2-linux-ix-slave15 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 5VY17LN3
Firmware Revision: CC45
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 208, current value: 208
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@mv-moz2-linux-ix-slave15 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29336 MB in 2.00 seconds = 14704.02 MB/sec
Timing buffered disk reads: 278 MB in 3.00 seconds = 92.62 MB/sec
Comment 25•14 years ago
|
||
A1-14139 mv-moz2-linux-ix-slave02
is unreachable
Comment 26•14 years ago
|
||
[root@mv-moz2-linux-ix-slave11 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 5VY0LAK8
Firmware Revision: CC45
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 208, current value: 208
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@mv-moz2-linux-ix-slave11 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29480 MB in 1.99 seconds = 14776.95 MB/sec
Timing buffered disk reads: 370 MB in 3.01 seconds = 123.04 MB/sec
Updated•14 years ago
|
Whiteboard: [buildslaves][hardware] → [buildslaves][hardware][triagefollowup]
Comment 27•14 years ago
|
||
This is pretty important and we need to make progress here.
Can we start taking these machines offline in batches, i.e. gracefully shutdown one slave from each platform (linux, linux64, win32, win64), run the diagnostics, add those slave back to the pool, and then move on to the next batch?
Also, it's probably not ideal to post the results for each slave in the bug. I'd suggest creating a subdir for the output logs on people.mozilla.com and linking to it from the bug.
Not fun, I realize, but required.
Priority: -- → P3
Whiteboard: [buildslaves][hardware][triagefollowup] → [buildslaves][hardware]
Comment 28•14 years ago
|
||
I really think we should put some time into this bug ? Some examples of wrongness I've seen today
* linux-ix-slave13 taking 2+ hours to do a 1.9.2 unit test build, holding rs up
* w32-ix-slave16 taking 4hrs 20 mins to compile a try opt build
Need some data so that IX know what to fix up.
Comment 29•14 years ago
|
||
maybe we could do this during the mega-downtime this comming friday
Assignee: jhford → nobody
Comment 30•14 years ago
|
||
(In reply to comment #29)
> maybe we could do this during the mega-downtime this comming friday
Sure, but let's be specific here:
* who's going to be around to do this on Friday, given that many of us are traveling?
* will it be IT or RelEng running the tests?
* is it just hdparm output we're looking for, or are there other tests we could/should be running?
Comment 31•14 years ago
|
||
I think I posted this somewhere else, but can't find it right now...
Since the IX machines can boot off an image provided via the ipmi interface, we could boot off something like http://www.sysresccd.org/, which provides hdparm.
Comment 32•14 years ago
|
||
(In reply to comment #30)
> (In reply to comment #29)
> > maybe we could do this during the mega-downtime this comming friday
>
> Sure, but let's be specific here:
>
> * who's going to be around to do this on Friday, given that many of us are
> traveling?
> * will it be IT or RelEng running the tests?
Debugging this hardware difference is something for IT. Pushing to zandr after talking with him on irc.
> * is it just hdparm output we're looking for, or are there other tests we
> could/should be running?
Assignee: nobody → zandr
Comment 33•14 years ago
|
||
zandr: if we need to take (more of) these out of service at some point to get this done, just let me know.
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Assignee | ||
Comment 34•14 years ago
|
||
At this point, I'd like to circle back with iX and see what they're thinking. We've had a number of outright failures, and this is a lot of data collection.
Assignee | ||
Comment 35•14 years ago
|
||
(In reply to comment #31)
> I think I posted this somewhere else, but can't find it right now...
>
> Since the IX machines can boot off an image provided via the ipmi interface, we
> could boot off something like http://www.sysresccd.org/, which provides hdparm.
Good call, I've put an image at \Users\administrator\public\sysresccd-x86-1.6.4.iso that works well enough. from linux64-ix-slave33:
root@sysresccd /root % hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 6VMGX2N0
Firmware Revision: CC38
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
cache/buffer size = 8192 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
46min for SECURITY ERASE UNIT. 46min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50027fec700
NAA : 5
IEEE OUI : 000c50
Unique ID : 027fec700
Checksum: correct
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13764 MB in 2.00 seconds = 6898.63 MB/sec
Timing buffered disk reads: 290 MB in 3.01 seconds = 96.24 MB/sec
root@sysresccd /root %
Assignee | ||
Comment 36•14 years ago
|
||
Known failures on:
linux-ix-slave13 bug 619624
linux-ix-slave33 bug 602288
linux-ix-slave35 bug 602288
w32-ix-slave41 bug 615744
Assignee | ||
Comment 37•14 years ago
|
||
linux64-ix-slave40:
root@sysresccd /root % hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 9VMKCQF2
Firmware Revision: CC46
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
cache/buffer size = 8192 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
44min for SECURITY ERASE UNIT. 44min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50026db6f64
NAA : 5
IEEE OUI : 000c50
Unique ID : 026db6f64
Checksum: correct
root@sysresccd /root % hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 13892 MB in 2.00 seconds = 6962.22 MB/sec
Timing buffered disk reads: 284 MB in 3.01 seconds = 94.27 MB/sec
root@sysresccd /root %
Assignee | ||
Comment 38•14 years ago
|
||
Not on the list, but a reported failure from https://bugzilla.mozilla.org/show_bug.cgi?id=620948#c22
[root@linux-ix-slave34 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 6VY7450E
Firmware Revision: CC38
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave34 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29340 MB in 1.99 seconds = 14740.90 MB/sec
Timing buffered disk reads: 58 MB in 3.12 seconds = 18.58 MB/sec
Assignee | ||
Comment 39•14 years ago
|
||
A1-16213 linux64-ix-slave41
root@sysresccd /root % hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 6VMGSAD1
Firmware Revision: CC38
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
cache/buffer size = 8192 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
44min for SECURITY ERASE UNIT. 44min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50027f5e723
NAA : 5
IEEE OUI : 000c50
Unique ID : 027f5e723
Checksum: correct
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13676 MB in 2.00 seconds = 6854.45 MB/sec
Timing buffered disk reads: 278 MB in 3.01 seconds = 92.45 MB/sec
root@sysresccd /root %
Assignee | ||
Comment 40•14 years ago
|
||
A1-16188 linux64-ix-slave16
root@sysresccd /root % hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 5VMF041Q
Firmware Revision: CC38
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
cache/buffer size = 8192 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50027f25694
NAA : 5
IEEE OUI : 000c50
Unique ID : 027f25694
Checksum: correct
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13468 MB in 2.00 seconds = 6749.66 MB/sec
Timing buffered disk reads: 84 MB in 3.02 seconds = 27.78 MB/sec
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13826 MB in 2.00 seconds = 6929.62 MB/sec
Timing buffered disk reads: 82 MB in 3.02 seconds = 27.19 MB/sec
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13838 MB in 2.00 seconds = 6935.35 MB/sec
Timing buffered disk reads: 84 MB in 3.02 seconds = 27.85 MB/sec
root@sysresccd /root %
Assignee | ||
Comment 41•14 years ago
|
||
A1-16189 linux64-ix-slave17
root@sysresccd /root % hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 5VMEYY27
Firmware Revision: CC38
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
cache/buffer size = 8192 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
44min for SECURITY ERASE UNIT. 44min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50027ecb27d
NAA : 5
IEEE OUI : 000c50
Unique ID : 027ecb27d
Checksum: correct
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13828 MB in 2.00 seconds = 6930.66 MB/sec
Timing buffered disk reads: 288 MB in 3.02 seconds = 95.49 MB/sec
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13764 MB in 2.00 seconds = 6897.89 MB/sec
Timing buffered disk reads: 296 MB in 3.02 seconds = 98.16 MB/sec
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13782 MB in 2.00 seconds = 6907.71 MB/sec
Timing buffered disk reads: 290 MB in 3.01 seconds = 96.36 MB/sec
root@sysresccd /root %
Assignee | ||
Comment 42•14 years ago
|
||
A1-16114 w64-ix-slave09
root@sysresccd /root % hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 6VMGJETD
Firmware Revision: CC38
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
cache/buffer size = 8192 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
46min for SECURITY ERASE UNIT. 46min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50027e09cf2
NAA : 5
IEEE OUI : 000c50
Unique ID : 027e09cf2
Checksum: correct
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13688 MB in 2.00 seconds = 6859.80 MB/sec
Timing buffered disk reads: 316 MB in 3.01 seconds = 104.99 MB/sec
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13646 MB in 2.00 seconds = 6839.39 MB/sec
Timing buffered disk reads: 316 MB in 3.01 seconds = 104.98 MB/sec
root@sysresccd /root % hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 13706 MB in 2.00 seconds = 6869.15 MB/sec
Timing buffered disk reads: 314 MB in 3.01 seconds = 104.40 MB/sec
root@sysresccd /root %
Comment 43•14 years ago
|
||
In comment #3, bhearsum mentions linux-ix-slave17 as a slave encountering this problem, but I don't see it listed in the list of slave in comment #20.
Is the list in comment #20 meant to be exhaustive?
Assignee | ||
Comment 44•14 years ago
|
||
The list in comment #20 is the list of machines iX systems wanted to sample, not the list of problem machines. Many of them do not have performance problems at all. But I haven't tested a lot of them yet.
Spreadsheet showing testing and failures here:
https://spreadsheets.google.com/ccc?key=0AqPtmipKTyewdFNHRzVnZ1RhOFNsWmdmbjV4UmRwLXc&authkey=CMHYj7kD&hl=en#gid=0
Will need to work with buildduty to take those machines out of service for testing, but I'm currently prioritizing that below the nagios problem.
Assignee | ||
Comment 45•14 years ago
|
||
Making this the bug for all the ix-slave drive issues.
Whiteboard: [buildslaves][hardware] → [buildslaves][hardware][duptome]
Assignee | ||
Updated•14 years ago
|
Summary: latest batch of ix machines have slow i/o → latest batch of ix machines have slow and failing drives
Updated•14 years ago
|
Alias: ix-drive-issues
Comment 50•14 years ago
|
||
I just linked to this bug:
* linux-ix-slave01 (bug 624371)
* linux-ix-slave06 (bug 624210)
Comment 51•14 years ago
|
||
I also suspect w32-ix-slave23
Comment 52•14 years ago
|
||
w32-ix-slave07 is also pretty slow. About 6 hours for a try opt build is well over the expected value. Still in prod at the moment.
Comment 53•14 years ago
|
||
zandr: any progress here?
Assignee | ||
Comment 54•14 years ago
|
||
Not really. Going to collate the long list of machines you guys have marked as broken and re-ping iX.
Assignee | ||
Comment 55•14 years ago
|
||
[root@linux-ix-slave16 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 6VY6FJND
Firmware Revision: CC38
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave16 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 12560 MB in 2.01 seconds = 6240.93 MB/sec
Timing buffered disk reads: 88 MB in 3.01 seconds = 29.20 MB/sec
Assignee | ||
Comment 56•14 years ago
|
||
[root@linux-ix-slave31 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 5VMF5BG2
Firmware Revision: CC38
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave31 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29332 MB in 1.99 seconds = 14737.70 MB/sec
Timing buffered disk reads: 342 MB in 3.00 seconds = 113.82 MB/sec
[root@linux-ix-slave31 ~]#
Assignee | ||
Comment 57•14 years ago
|
||
(In reply to comment #11)
> Confirming Lukas's comment
>
> mv-moz2-linux-slave02
> linux-ix-slave14
>
> linux-ix-slave31 (scl)
>
> These machines were taken by Chris Williams from IX Systems today to
> investigate the i/o issues, will report back when I receive an update from IX
These machines are back, linux-ix-slave14 and 31 are now in scl1. -31 looks fast now. I'll get mv-..slave02 back up today.
Assignee | ||
Comment 58•14 years ago
|
||
Oooh, this is ugly:
[root@linux-ix-slave34 ~]# hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ST3250318AS
Serial Number: 6VY7450E
Firmware Revision: CC38
Transport: Serial
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 488397168
device size with M = 1024*1024: 238475 MBytes
device size with M = 1000*1000: 250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct
[root@linux-ix-slave34 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29628 MB in 1.99 seconds = 14885.19 MB/sec
Timing buffered disk reads: read(2097152) returned 499712 bytes
[root@linux-ix-slave34 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 29336 MB in 1.99 seconds = 14737.86 MB/sec
Timing buffered disk reads: read(2097152) returned 503808 bytes
[root@linux-ix-slave34 ~]#
Reporter | ||
Comment 60•14 years ago
|
||
From bug 624210:
(In reply to comment #6)
> Sad, but not very sad. Here's the usual test data for iX:
>
> [root@linux-ix-slave06 ~]# hdparm -I /dev/sda
>
> /dev/sda:
>
> ATA device, with non-removable media
> Model Number: ST3250318AS
> Serial Number: 9VY95DR1
> Firmware Revision: CC38
> Transport: Serial
> Standards:
> Supported: 8 7 6 5
> Likely used: 8
> Configuration:
> Logical max current
> cylinders 16383 16383
> heads 16 16
> sectors/track 63 63
> --
> CHS current addressable sectors: 16514064
> LBA user addressable sectors: 268435455
> LBA48 user addressable sectors: 488397168
> device size with M = 1024*1024: 238475 MBytes
> device size with M = 1000*1000: 250059 MBytes (250 GB)
> Capabilities:
> LBA, IORDY(can be disabled)
> Queue depth: 32
> Standby timer values: spec'd by Standard, no device specific minimum
> R/W multiple sector transfer: Max = 16 Current = ?
> Recommended acoustic management value: 254, current value: 0
> DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 *udma3 udma4 udma5 udma6
> Cycle time: min=120ns recommended=120ns
> PIO: pio0 pio1 pio2 pio3 pio4
> Cycle time: no flow control=120ns IORDY flow control=120ns
> Commands/features:
> Enabled Supported:
> * SMART feature set
> Security Mode feature set
> * Power Management feature set
> * Write cache
> * Look-ahead
> * Host Protected Area feature set
> * WRITE_BUFFER command
> * READ_BUFFER command
> * DOWNLOAD_MICROCODE
> SET_MAX security extension
> * Automatic Acoustic Management feature set
> * 48-bit Address feature set
> * Device Configuration Overlay feature set
> * Mandatory FLUSH_CACHE
> * FLUSH_CACHE_EXT
> * SMART error logging
> * SMART self-test
> * General Purpose Logging feature set
> * WRITE_{DMA|MULTIPLE}_FUA_EXT
> * 64-bit World wide name
> Write-Read-Verify feature set
> * WRITE_UNCORRECTABLE command
> * {READ,WRITE}_DMA_EXT_GPL commands
> * Segmented DOWNLOAD_MICROCODE
> * SATA-I signaling speed (1.5Gb/s)
> * SATA-II signaling speed (3.0Gb/s)
> * Native Command Queueing (NCQ)
> * Phy event counters
> Device-initiated interface power management
> * Software settings preservation
> Security:
> Master password revision code = 65534
> supported
> not enabled
> not locked
> not frozen
> not expired: security count
> supported: enhanced erase
> 42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT.
> Checksum: correct
> [root@linux-ix-slave06 ~]# hdparm -tT /dev/sda
>
> /dev/sda:
> Timing cached reads: 29336 MB in 1.99 seconds = 14738.11 MB/sec
> Timing buffered disk reads: 280 MB in 3.01 seconds = 93.00 MB/sec
Reporter | ||
Comment 61•14 years ago
|
||
w32-ix-slave41 is incapable of cloning large repositories and feels *very* slow overall.
Comment 62•14 years ago
|
||
(In reply to comment #57)
> (In reply to comment #11)
> > Confirming Lukas's comment
> >
> > mv-moz2-linux-slave02
> > linux-ix-slave14
> >
> > linux-ix-slave31 (scl)
> >
> > These machines were taken by Chris Williams from IX Systems today to
> > investigate the i/o issues, will report back when I receive an update from IX
>
> These machines are back, linux-ix-slave14 and 31 are now in scl1. -31 looks
> fast now. I'll get mv-..slave02 back up today.
Not sure what happened exactly, but linux-ix-slave31 was burning at least one build.
~cltbld/.ssh was owned by root and had staging keys. I chowned it to cltbld and did an rsync from linux-ix-slave32:.ssh/ (had to rsync with --rsh="ssh -oBatchMode=no").
Updated•14 years ago
|
Assignee | ||
Comment 63•14 years ago
|
||
I've gone back and forth a couple of times with iX Systems about this. They believe that this is caused by vibration from the fans. They had initially suggested that going to lower speed fans would be a suitable solution, but after further discussion we'd rather not reduce cooling in the machines.
We're going to replace the fans in known bad machines, including testing the w64 and linux64 stacks. and go from there.
The problem is reproducible at iX, and they have adjusted their burnin procedures to catch this.
Assignee | ||
Comment 64•14 years ago
|
||
Assigning to Spencer to collect data on the machines that aren't in service yet.
w64-ix-slave07 through slave41
linux64-ix-slave01 through slave41
On each of these machines:
1. Boot through IPMI into sysresccd as a Virtual CD Image.
2. run 'hdparm -I /dev/sda' and 'hdparm -tT /dev/sda' and capture the output
This is easiest if you set a root password once you've booted into sysresccd and then ssh in.
Please run through these machines as you can and put the data in attachments. In particular, we're looking for any machines where hdparm -tT reports buffered disk reads < 90MB/sec.
Assignee: zandr → shui
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
Comment 65•14 years ago
|
||
w64-ix-slave07 through slave41 data have been collected
started on linux64-ix-slave01 through slave41
Reporter | ||
Comment 66•14 years ago
|
||
Not sure if this is still the right place to add new things, but w32-ix-slave02 is much slower than other windows IX machine, I think it needs a new fan.
Comment 67•14 years ago
|
||
Data collected from Linux 64 1-20
Comment 68•14 years ago
|
||
Data from Linux 64 21-41
Comment 69•14 years ago
|
||
Data from W64 7-22
Comment 70•14 years ago
|
||
Data from w64 23-41
Updated•14 years ago
|
Assignee: shui → server-ops-releng
Comment 72•14 years ago
|
||
(In reply to comment #52)
> w32-ix-slave07 is also pretty slow. About 6 hours for a try opt build is well
> over the expected value. Still in prod at the moment.
Pulled out.
Assignee | ||
Comment 73•14 years ago
|
||
[subject to embargo] because this involves yanking machines out of racks that are full of production boxes. I'm not willing to risk pulling the wrong cable after walking around a row of racks.
Whiteboard: [buildslaves][hardware][duptome] → [buildslaves][hardware][duptome][subject to embargo]
Comment 74•14 years ago
|
||
1) zandr to parse attachments to create list of machines to be pulled and shipped back to IX systems. For everyone else's sanity, a clear list will be posted to this bug! :-)
2) IX systems planning on replacing defective fans on machines.
Comment 75•14 years ago
|
||
(In reply to comment #74)
> 1) zandr to parse attachments to create list of machines to be pulled and
> shipped back to IX systems. For everyone else's sanity, a clear list will be
> posted to this bug! :-)
(clicked too soon: physically pulling the machines is subject to embargo, but creating the list is not.)
> 2) IX systems planning on replacing defective fans on machines.
Assignee | ||
Comment 76•14 years ago
|
||
w32-ix-slave08 has a failed drive "S.M.A.R.T. status BAD" in BIOS
Comment 77•14 years ago
|
||
The list of slaves I have with slow IO is:
linux-ix-slave33
linux-ix-slave34
linux-ix-slave35
mv-moz2-linux-ix-slave12
w32-ix-slave08
w32-ix-slave23
I will shut all of these machines down right now. Zandr, if you can union this with your list and post here, I will make sure that any still-running machines are safely shut down.
Comment 78•14 years ago
|
||
List of the machines that run below 90MB/s in the hdparm -tT result
linux64-ix-slave04 89.98 MB/sec
linux64-ix-slave10 69.91 MB/sec
linux64-ix-slave11 21.27 MB/sec
linux64-ix-slave16 73.21 MB/sec
w64-ix-slave07 85.02 MB/sec
w64-ix-slave11 58.65 MB/sec
w64-ix-slave23 92.39 MB/sec
Assignee | ||
Comment 79•14 years ago
|
||
(In reply to comment #78)
> List of the machines that run below 90MB/s in the hdparm -tT result
> w64-ix-slave23 92.39 MB/sec
92 > 90 ?
Comment 80•14 years ago
|
||
So the combined list, not considering 92 as less than 90 (accuracy, accuracy, accuracy!!) is:
linux-ix-slave33 - shut down
linux-ix-slave34 - shut down
linux-ix-slave35 - shut down
linux64-ix-slave04 - not in production, ok to yank power
linux64-ix-slave10 - not in production, ok to yank power
linux64-ix-slave11 - not in production, ok to yank power
linux64-ix-slave16 - not in production, ok to yank power
mv-moz2-linux-ix-slave12 - shut down
w32-ix-slave08 - shut down
w32-ix-slave23 - shut down
w64-ix-slave07 - not in production, ok to yank power
w64-ix-slave11 - not in production, ok to yank power
w64-ix-slave23 - not in production, ok to yank power
so I'm happy to see this bunch uncabled and shipped off at your collective convenience. Then let's close this bug and re-open slave-specific bugs for any subsequent failures we see. Zandr, is there something in place with IT to track slaves that have been sent to IX so that we don't double-send one?
Assignee | ||
Comment 81•14 years ago
|
||
Also found:
linux64-ix-slave12 wasn't tested (the file by that name contains linux64-ix-slave11's results)
linux64-ix-slave13 wasn't tested (the file does not contain hdparm -tT results)
linux64-ix-slave17 wasn't tested (file contains slave16's results)
linux64-ix-slave26 wasn't tested (file contains slave25's results)
w64-ix-slave29 wasn't tested (file contains slave27's results)
w64-ix-slave31 wasn't tested (file contains slave30's results)
w64-ix-slave36 wasn't tested (file contains slave35's results)
w64-ix-slave37 wasn't tested (file contains slave30's results)
Spencer, please retest these 8 machines.
Assignee | ||
Comment 82•14 years ago
|
||
My unified list, rolling up all the failures I know about (including comment 80):
linux-ix-slave01: bug 624371
linux-ix-slave06: bug 624210
linux-ix-slave13: bug 619624
linux-ix-slave16: comment 8
linux-ix-slave17: comment 55
linux-ix-slave33: bug 620124
linux-ix-slave34: comment 38, comment 58
linux-ix-slave35: bug 620124
linux-ix-slave42: bug 624207
linux64-ix-slave04: comment 78
linux64-ix-slave10: comment 78
linux64-ix-slave11: comment 78
linux64-ix-slave16: comment 78
mv-moz2-linux-ix-slave02:
w32-ix-slave07:
w32-ix-slave08: bug 635416#c31
w32-ix-slave41: bug 615744
w64-ix-slave02: bug 638814
w64-ix-slave07: comment 78
w64-ix-slave11: comment 78
Dustin- could you verify that the additional machines are all out of service?
Assignee | ||
Updated•14 years ago
|
Assignee: server-ops-releng → shui
Comment 83•14 years ago
|
||
linux64-ix-slave11 37.61 MB/sec
linux64-ix-slave12 29.97 MB/sec
linux64-ix-slave13 29.97 MB/sec
linux64-ix-slave26 bad drive
Assignee | ||
Comment 84•14 years ago
|
||
(In reply to comment #82)
> My unified list, rolling up all the failures I know about (including comment
> 80):
> mv-moz2-linux-ix-slave02:
Typo, that's
mv-moz2-linux-ix-slave12
Will comment with complete list when I get the test data back from spencer.
Assignee | ||
Comment 85•14 years ago
|
||
Final consolidated list:
linux-ix-slave01 bug 624371 A1-16072 4620 scl1
linux-ix-slave06 bug 624210 A1-16077 4625 scl1
linux-ix-slave13 bug 619624 A1-16084 4632 scl1
linux-ix-slave01: bug 624371 A1-16072 4620 scl1
linux-ix-slave06: bug 624210 A1-16077 4625 scl1
linux-ix-slave13: bug 619624 A1-16077 4632 scl1
linux-ix-slave16: comment 8 A1-16087 4635 scl1
linux-ix-slave17: comment 55 A1-16088 4636 scl1
linux-ix-slave33: bug 620124 A1-16163 4674 scl1
linux-ix-slave34: comment 58 A1-16164 4675 scl1
linux-ix-slave35: bug 620124 A1-16165 4676 scl1
linux-ix-slave42: bug 624207 A1-16172 4773 scl1
linux64-ix-slave04: comment 78 A1-16176 4777 scl1
linux64-ix-slave10: comment 78 A1-16182 4783 scl1
linux64-ix-slave11: comment 78 A1-16183 4784 scl1
linux64-ix-slave12: comment 83 A1-16184 4785 scl1
linux64-ix-slave13: comment 83 A1-16185 4786 scl1
linux64-ix-slave16: comment 78 A1-16188 4789 scl1
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
w32-ix-slave41: bug 615744 A1-16104 4705 scl1
w64-ix-slave02: bug 638814 A1-16107 4708 scl1
w64-ix-slave07: comment 78 A1-16112 4713 scl1
w64-ix-slave11: comment 78 A1-16116 4717 scl1
Comment 86•14 years ago
|
||
linux-ix-slave01 bug 624371 A1-16072 4620 scl1
linux-ix-slave06 bug 624210 A1-16077 4625 scl1
linux-ix-slave13 bug 619624 A1-16084 4632 scl1
linux-ix-slave16: comment 8 A1-16087 4635 scl1
all in production
linux-ix-slave17: comment 55 A1-16088 4636 scl1
linux-ix-slave33: bug 620124 A1-16163 4674 scl1
linux-ix-slave34: comment 58 A1-16164 4675 scl1
linux-ix-slave35: bug 620124 A1-16165 4676 scl1
all shut down and ready to go
linux-ix-slave42: bug 624207 A1-16172 4773 scl1
can't connect, but in staging - yank its power cord
linux64-ix-slave04: comment 78 A1-16176 4777 scl1
linux64-ix-slave10: comment 78 A1-16182 4783 scl1
linux64-ix-slave11: comment 78 A1-16183 4784 scl1
linux64-ix-slave12: comment 83 A1-16184 4785 scl1
linux64-ix-slave13: comment 83 A1-16185 4786 scl1
linux64-ix-slave16: comment 78 A1-16188 4789 scl1
all in staging, and yet to be reimaged - yank cord
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
w32-ix-slave41: bug 615744 A1-16104 4705 scl1
all shut down and ready to go
w64-ix-slave02: bug 638814 A1-16107 4708 scl1
w64-ix-slave07: comment 78 A1-16112 4713 scl1
w64-ix-slave11: comment 78 A1-16116 4717 scl1
all in staging, and yet to be reimaged - yank cord
shouldn't w64-ix-slave23 be on the list, too?
Comment 87•14 years ago
|
||
Sorry, that should have ended with a question: do you want to send these all back together, in which case I'll shut down the production slaves, or will you be batching them, in which case let's leave them up since we're short on slaves right now?
Assignee | ||
Comment 88•14 years ago
|
||
Let's leave the ones that are in production up, and send the rest of these off first. When we get them back online, we can shut down the production machines.
Comment 89•14 years ago
|
||
w32-ix-slave23 is also missing from comment #85.
Assignee | ||
Comment 90•14 years ago
|
||
(In reply to comment #89)
> w32-ix-slave23 is also missing from comment #85.
Per comment 78 and comment 79, it's over the (admittedly arbitrary) 90MB/s threshold, and thus not in the list.
Comment 91•14 years ago
|
||
Those comments are for w64-ix-slave23. I'm referring to comment #51, where we noticed it was slow in production, and comment #77.
Assignee | ||
Comment 92•14 years ago
|
||
(In reply to comment #91)
> Those comments are for w64-ix-slave23. I'm referring to comment #51, where we
> noticed it was slow in production, and comment #77.
Similar names are similar. Sigh.
Thanks, good catch. Added below.
linux-ix-slave01 bug 624371 A1-16072 4620 scl1
linux-ix-slave06 bug 624210 A1-16077 4625 scl1
linux-ix-slave13 bug 619624 A1-16084 4632 scl1
linux-ix-slave01: bug 624371 A1-16072 4620 scl1
linux-ix-slave06: bug 624210 A1-16077 4625 scl1
linux-ix-slave13: bug 619624 A1-16077 4632 scl1
linux-ix-slave16: comment 8 A1-16087 4635 scl1
linux-ix-slave17: comment 55 A1-16088 4636 scl1
linux-ix-slave33: bug 620124 A1-16163 4674 scl1
linux-ix-slave34: comment 58 A1-16164 4675 scl1
linux-ix-slave35: bug 620124 A1-16165 4676 scl1
linux-ix-slave42: bug 624207 A1-16172 4773 scl1
linux64-ix-slave04: comment 78 A1-16176 4777 scl1
linux64-ix-slave10: comment 78 A1-16182 4783 scl1
linux64-ix-slave11: comment 78 A1-16183 4784 scl1
linux64-ix-slave12: comment 83 A1-16184 4785 scl1
linux64-ix-slave13: comment 83 A1-16185 4786 scl1
linux64-ix-slave16: comment 78 A1-16188 4789 scl1
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
w32-ix-slave23: comment 51 A1-16069 4617 scl1
w32-ix-slave41: bug 615744 A1-16104 4705 scl1
w64-ix-slave02: bug 638814 A1-16107 4708 scl1
w64-ix-slave07: comment 78 A1-16112 4713 scl1
w64-ix-slave11: comment 78 A1-16116 4717 scl1
Assignee | ||
Comment 93•14 years ago
|
||
This is back on my plate to get iX to come collect them.
Per conversation with Dustin, we'll leave the ones that are in production running and swap them out after we get the others back from iX.
Assignee: shui → zandr
Comment 94•14 years ago
|
||
buildbot-master6, nee w64-ix-slave06, has the same 'ata1: spurious interrupt' messages as linux-ix-slave01 and linux-ix-slave13. I think it should also go back to IX.
Comment 95•14 years ago
|
||
Should we also consider running a more in depth test like bonnie++?
Assignee | ||
Comment 96•14 years ago
|
||
(In reply to comment #94)
> buildbot-master6, nee w64-ix-slave06, has the same 'ata1: spurious interrupt'
> messages as linux-ix-slave01 and linux-ix-slave13. I think it should also go
> back to IX.
Roger, I'll put that in the list.
(In reply to comment #95)
> Should we also consider running a more in depth test like bonnie++?
If you find performance problems that don't show as errors, we should look at it, but I don't see any point in going back through this list of machines and oing more testing. iX has improved their burn-in procedures to catch these sorts of problems and these machines will all go through burnin before they come back from repair.
Assignee | ||
Comment 98•14 years ago
|
||
linux-ix-slave01 bug 624371 A1-16072 4620 scl1
linux-ix-slave06 bug 624210 A1-16077 4625 scl1
linux-ix-slave13 bug 619624 A1-16084 4632 scl1
linux-ix-slave01: bug 624371 A1-16072 4620 scl1
linux-ix-slave06: bug 624210 A1-16077 4625 scl1
linux-ix-slave13: bug 619624 A1-16077 4632 scl1
linux-ix-slave16: comment 8 A1-16087 4635 scl1
linux-ix-slave17: comment 55 A1-16088 4636 scl1
linux-ix-slave33: bug 620124 A1-16163 4674 scl1
linux-ix-slave34: comment 58 A1-16164 4675 scl1
linux-ix-slave35: bug 620124 A1-16165 4676 scl1
linux-ix-slave42: bug 624207 A1-16172 4773 scl1
linux64-ix-slave04: comment 78 A1-16176 4777 scl1
linux64-ix-slave10: comment 78 A1-16182 4783 scl1
linux64-ix-slave11: comment 78 A1-16183 4784 scl1
linux64-ix-slave12: comment 83 A1-16184 4785 scl1
linux64-ix-slave13: comment 83 A1-16185 4786 scl1
linux64-ix-slave16: comment 78 A1-16188 4789 scl1
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
w32-ix-slave23: comment 51 A1-16069 4617 scl1
w32-ix-slave41: bug 615744 A1-16104 4705 scl1
w64-ix-slave02: bug 638814 A1-16107 4708 scl1
w64-ix-slave06: bug 639628#c22 A1-16111 4712 scl1
w64-ix-slave07: comment 78 A1-16112 4713 scl1
w64-ix-slave11: comment 78 A1-16116 4717 scl1
Assignee | ||
Comment 101•14 years ago
|
||
[root@linux64-ix-slave41 ~]# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 28928 MB in 1.99 seconds = 14532.52 MB/sec
Timing buffered disk reads: 160 MB in 3.07 seconds = 52.16 MB/sec
So, current list is:
linux-ix-slave01 bug 624371 A1-16072 4620 scl1
linux-ix-slave06 bug 624210 A1-16077 4625 scl1
linux-ix-slave13 bug 619624 A1-16084 4632 scl1
linux-ix-slave01: bug 624371 A1-16072 4620 scl1
linux-ix-slave06: bug 624210 A1-16077 4625 scl1
linux-ix-slave13: bug 619624 A1-16077 4632 scl1
linux-ix-slave16: comment 8 A1-16087 4635 scl1
linux-ix-slave17: comment 55 A1-16088 4636 scl1
linux-ix-slave33: bug 620124 A1-16163 4674 scl1
linux-ix-slave34: comment 58 A1-16164 4675 scl1
linux-ix-slave35: bug 620124 A1-16165 4676 scl1
linux-ix-slave42: bug 624207 A1-16172 4773 scl1
linux64-ix-slave04: comment 78 A1-16176 4777 scl1
linux64-ix-slave10: comment 78 A1-16182 4783 scl1
linux64-ix-slave11: comment 78 A1-16183 4784 scl1
linux64-ix-slave12: comment 83 A1-16184 4785 scl1
linux64-ix-slave13: comment 83 A1-16185 4786 scl1
linux64-ix-slave16: comment 78 A1-16188 4789 scl1
linux64-ix-slave41: comment 101 A1-16213 4814 scl1
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
w32-ix-slave23: comment 51 A1-16069 4617 scl1
w32-ix-slave41: bug 615744 A1-16104 4705 scl1
w64-ix-slave02: bug 638814 A1-16107 4708 scl1
w64-ix-slave06: bug 639628#c22 A1-16111 4712 scl1
w64-ix-slave07: comment 78 A1-16112 4713 scl1
w64-ix-slave11: comment 78 A1-16116 4717 scl1
Starting to power down the machines that are not in production per comment 86.
Assignee | ||
Comment 102•14 years ago
|
||
Added linux-ix-slave40
linux-ix-slave01: bug 624371 A1-16072 4620 scl1
linux-ix-slave06: bug 624210 A1-16077 4625 scl1
linux-ix-slave13: bug 619624 A1-16084 4632 scl1
linux-ix-slave16: comment 8 A1-16087 4635 scl1
linux-ix-slave17: comment 55 A1-16088 4636 scl1
linux-ix-slave33: bug 620124 A1-16163 4674 scl1
linux-ix-slave34: comment 58 A1-16164 4675 scl1
linux-ix-slave35: bug 620124 A1-16165 4676 scl1
linux-ix-slave42: bug 624207 A1-16172 4773 scl1
linux64-ix-slave04: comment 78 A1-16176 4777 scl1
linux64-ix-slave10: comment 78 A1-16182 4783 scl1
linux64-ix-slave11: comment 78 A1-16183 4784 scl1
linux64-ix-slave12: comment 83 A1-16184 4785 scl1
linux64-ix-slave13: comment 83 A1-16185 4786 scl1
linux64-ix-slave16: comment 78 A1-16188 4789 scl1
linux64-ix-slave40: comment 102 A1-16212 4813 scl1
linux64-ix-slave41: comment 101 A1-16213 4814 scl1
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
w32-ix-slave23: comment 51 A1-16069 4617 scl1
w32-ix-slave41: bug 615744 A1-16104 4705 scl1
w64-ix-slave02: bug 638814 A1-16107 4708 scl1
w64-ix-slave06: bug 639628#c22 A1-16111 4712 scl1
w64-ix-slave07: comment 78 A1-16112 4713 scl1
w64-ix-slave11: comment 78 A1-16116 4717 scl1
Assignee | ||
Comment 103•14 years ago
|
||
All of the scl1 machines have been pulled and are awaiting pickup by iX. (should happen today)
Comment 104•14 years ago
|
||
This list has been a bit leaky, and we're slowly finding other machines with similar problems. Coming back to this bug later will be a major headache, so we should probably track which machines have seen this treatment in the inventory somehow.
Comment 105•14 years ago
|
||
I just noticed that comment 103 suggests we sent the four in-production slaves back as well (linux-ix-slave01, 06, 13, and 16). Is that the case?
Assignee | ||
Comment 106•14 years ago
|
||
(In reply to comment #105)
> I just noticed that comment 103 suggests we sent the four in-production slaves
> back as well (linux-ix-slave01, 06, 13, and 16). Is that the case?
We did. It was the opinion of buildduty at the time that the mtv1 slaves were running well enough post-firewall changes that we could afford to lose the production machines.
To your point in comment 104, I'll add the return-from-repair date to the notes field in inventory. As you find other machines, where are you noting them?
Comment 107•14 years ago
|
||
(In reply to comment #106)
> We did. It was the opinion of buildduty at the time that the mtv1 slaves were
> running well enough post-firewall changes that we could afford to lose the
> production machines.
Excellent, good to know.
> To your point in comment 104, I'll add the return-from-repair date to the notes
> field in inventory. As you find other machines, where are you noting them?
They're slowly getting bugs that aren't actively being duped here. bug 643397 is the first (aside from w64-ix-slave23, the dropping of which from this bug I still haven't seen explained?).
Assignee | ||
Comment 108•14 years ago
|
||
(In reply to comment #107)
> They're slowly getting bugs that aren't actively being duped here. bug 643397
> is the first
That was filed at quarter to six this morning, it's now just after 9. We may have a small gap on the definition of 'active'. :D That said, there's an opportunity here. New bug for the next batch of failures?
> (aside from w64-ix-slave23, the dropping of which from this bug I
> still haven't seen explained?).
comment 79?
Comment 109•14 years ago
|
||
(In reply to comment #108)
> (In reply to comment #107)
>
> > They're slowly getting bugs that aren't actively being duped here. bug 643397
> > is the first
>
> That was filed at quarter to six this morning, it's now just after 9. We may
> have a small gap on the definition of 'active'. :D That said, there's an
> opportunity here. New bug for the next batch of failures?
"not .. actively" mean't I'm not doing it - sorry for any ambiguity here. If you'd like a new tracker bug for the next batch, that sounds good. Open one up and copy me? We'll probably continue to have separate bugs to dupe into it, and won't dupe until we've verified (a) the machine hasn't already had its fans fixed and (b) the issue has a decent probability of being fan-related.
> > (aside from w64-ix-slave23, the dropping of which from this bug I
> > still haven't seen explained?).
>
> comment 79?
Math is hard. Let's go shopping! /me shuts up about that slave.
Comment 110•14 years ago
|
||
It looks like the ETA on the repair run in comment 103 is next Tuesday, March 29. Aki, if they're not back by then, let's put in another call to IX.
Assignee | ||
Comment 111•14 years ago
|
||
Update from iX Systems on 3/28:
Just wanted to give you a brief update on the systems we currently have. It appears most of the short-depth servers will be receiving replacement fans and we did find a few systems with failed components as well. We expect the new parts to be in this week and are anticipating to have the systems returned to you early next week. As for the current state of each system, I have compiled a status list for you below; which also includes your asset ID's, as well as our serial numbers.
A1-16077 - 04625 - Marginal Fan
A1-16213 - 04814 - Marginal Fan
A1-16185 - 04786 - Marginal Fan
A1-16087 - 04635 - Marginal Fan
A1-16084 - 04632 - Marginal Fan
A1-16182 - 04783 - Marginal Fan
A1-16069 - 04617 - Marginal Fan
A1-16188 - 04789 - Marginal Fan
A1-16072 - 04620 - Marginal Fan
A1-16164 - 04765 - Marginal Fan
A1-16212 - 04813 - Marginal Fan
A1-16183 - 04784 - Marginal Fan
A1-14132 - 03121 - Marginal Fan
A1-16053 - 04601 - Marginal Fan
A1-16176 - 04777 - Marginal Fan
A1-16107 - 04708 - Marginal Fan
A1-16184 - 04785 - Marginal Fan
A1-15844 - No Mozilla ID? - Failed Drive (WD6000BLHX-01V7BV0)
A1-16104 - 04705 - Marginal Fan & Failed Drive (ST325018AS)
A1-16116 - 04717 - Marginal Fan & Failed Drive (ST325018AS)
A1-16165 - 04766 - Marginal Fan & Failed Drive (ST325018AS)
A1-16163 - 04764 - Marginal Fan & Failed Drive (ST325018AS)
A1-16111 - 04712 - Marginal Fan & Uncorrectable MCE (Motherboard)
A1-16054 - 04602 - Marginal Fan & Uncorrectable MCE (Motherboard)
A1-16172 - 04773 - Marginal Fan & Correctable MCE (Memory)
A1-16088 - 04636 - No Problems Found, Additional Testing in Process.
A1-16112 - 04713 - No Problems Found, Additional Testing in Process.
Comment 112•14 years ago
|
||
(In reply to comment #111)
> Update from iX Systems on 3/28:
>
...
> We expect the new
> parts to be in this week and are anticipating to have the systems returned to
> you early next week.
Any news?
Comment 113•14 years ago
|
||
These were re-racked and re-imaged last night. I'll track bringing those slaves back in bug 650335.
I have the following slaves marked as being down and awaiting a second repair trip:
mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1
w32-ix-slave07: A1-16053 4601 mtv1
w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1
So let's start piling on to put together a second list and send it off.
Comment 114•14 years ago
|
||
At zandr's discretion, we could add to the systems in comment 113:
linux-ix-slave29 (bug 643397)
linux64-ix-slave03 (bug 648528)
buildbot-master1 (bug 644991)
linux64-ix-slave35 (bug 648312)
It would also be good to test the machines in bug 637973 before they get activated in their new home.
Assignee | ||
Comment 115•14 years ago
|
||
So, the machines in comment 113 are actually back from repair. w32-ix-slave07 and 08 were installed in SCL1 when they came back, and mv-moz2-linux-ix-slave12 is sitting on the bench in SCL1 waiting for a ride home.
Comment 116•13 years ago
|
||
per meeting with zandr yesterday: (was also discussed last week, but I couldnt find this noted in any other bug, so adding here).
1) on Tuesday (21stjune) ix systems tried changing fan/cooling in 4 of the ix machines in colo. 3 of 4 machines started to work much better. zandr to reconvene with vendor and decide what next.
2) we've already verified that the ix machines in 650castro are from both batch1 and batch2. This rules out the "batch2 has all bad disks" theory.
3) New theory is about vibration because of difference in floor, rack or chassis between the two locations.
3a) zandr to try moving "working" ix machines from here to scl1 and see if the machine stops working.
3b) zandr to try moving a "broken" ix machine from scl1 to 650castro to see if the machine starts working.
3c) zandr to confirm with the colo vendor how the racks are mounted to/through the raised floor (some side theories about whether the vibration problems for disks are caused by chassis design, how chassis is mounted to racks, or how racks are mounted to floor.)
Assignee | ||
Comment 117•13 years ago
|
||
(In reply to comment #116)
> 1) on Tuesday (21stjune) ix systems tried changing fan/cooling in 4 of the
> ix machines in colo. 3 of 4 machines started to work much better. zandr to
> reconvene with vendor and decide what next.
10, actually. 4 had been tested as of the conversation Thursday. Worked through the rest this evening. Excerpted from the update I just sent to iX:
We have successfully tested 7 of the 10 machines. Of those 7, 6 are within spec:
4712/A1-16111 119MB/s (WD RE4)
4719/A1-16117 128MB/s (WD RE4)
4731/A1-16130 129MB/s (WD RE4)
4735/A1-16134 127MB/s (Seagate Barracuda 7200.12)
4739/A1-16138 117MB/s (Seagate Barracuda 7200.12)
4743/A1-16142 121MB/s (Seagate Barracuda 7200.12)
One is slow:
4715/A1-16114 47MB/s (WD RE4)
Two have hardware problems:
4708/A1-16107 This one still turns itself off during boot. It was returned for repair with this complaint when you were on site, still has the problem.
4747/A1-16146 Neither the BIOS nor the OS detect a hard disk at all. I don't believe this machine has been back for repair yet, so it could simply be a bad Seagate drive.
4809/A1-16208 I can't reach this machine at all (host or IPMI) I'll verify power and network next time I'm on site. (could be tonight)
So we still have one machine that wasn't fixed by the new heatsink/fan assembly. Not sure what to make of that. The results otherwise are encouraging, even the Seagate Desktop drives seem happy.
> 3) New theory is about vibration because of difference in floor, rack or
> chassis between the two locations.
Theory has always been vibration. Slab vs. raised floor is an interesting factor.
> 3a) zandr to try moving "working" ix machines from here to scl1 and see if
> the machine stops working.
We have certainly moved lots of iX machines from mtv1 to scl1. I'll see if I can find out which ones (wasn't it all of them?) were previously running well in mtv1.
> 3b) zandr to try moving a "broken" ix machine from scl1 to 650castro to see
> if the machine starts working.
An entertaining, if academic, exercise. Given that we're getting good results from the new HSF arrangement, this is a low priority.
mtv1 (not an HA site) and scl2 (with one exception, no releng infra) are our only slab-floor sites. scl3 will be raised floor.
> 3c) zandr to confirm with the colo vendor how the racks are mounted
> to/through the raised floor
As discussed on Thursday, the racks are bolted through the floor tiles to unistrut, and the ends of each row of racks are anchored to the floor, again with unistrut.
Assignee | ||
Comment 118•13 years ago
|
||
The testing we've done seems to validate the fix. I suspect that 4715 is being affected by its neighbors. I've sent it back to iX, but I have every reason to expect that it will test fine at iX. 4708 and 4747 are also at iX for repair.
The plan going forward:
iX will order replacement heatsinks for all the machines. Once we have a delivery date, I'll set up a day or two with iX where we'll replace the heatsinks en masse.
iX will set up with a couple of guys on-site in scl1. Mozilla will provide 3-4 folks to manage moving machines in and out of racks.
Dustin will take a set of machines out of service to 'prime the pump'. Once those are ready to come out, we'll start pulling machines out of the rack and handing them off to iX.
iX will install the new HSF and the upgraded memory, and hand the machines back to us.
We'll get them racked back up, and hand them off to arr/Dustin for a quick smoke test and return to service.
As those machines come online, new machines will be finishing builds and ready to come out for upgrade. I expect the downtime for any given machine will be on the order of 15-30 minutes, and in the name of pipelining we might have 5-10 machines off at a time.
We can pull machines in any order as they become idle.
Paul (from iX) and I did 8 machines like this in something like 45min. If we can get two workflows running in parallel, we should be able to get this work done in one or two days.
("We" in this case is Zandr plus one or two folks from ops, staffing TBD)
Comment 119•13 years ago
|
||
(In reply to comment #117)
> (In reply to comment #116)
> > 1) on Tuesday (21stjune) ix systems tried changing fan/cooling in 4 of the
> > ix machines in colo. 3 of 4 machines started to work much better. zandr to
> > reconvene with vendor and decide what next.
>
> 10, actually. 4 had been tested as of the conversation Thursday. Worked
> through the rest this evening. Excerpted from the update I just sent to iX:
>
> We have successfully tested 7 of the 10 machines. Of those 7, 6 are within
> spec:
>
> 4712/A1-16111 119MB/s (WD RE4)
> 4719/A1-16117 128MB/s (WD RE4)
> 4731/A1-16130 129MB/s (WD RE4)
> 4735/A1-16134 127MB/s (Seagate Barracuda 7200.12)
> 4739/A1-16138 117MB/s (Seagate Barracuda 7200.12)
> 4743/A1-16142 121MB/s (Seagate Barracuda 7200.12)
>
> One is slow:
> 4715/A1-16114 47MB/s (WD RE4)
>
> Two have hardware problems:
> 4708/A1-16107 This one still turns itself off during boot. It was returned
> for repair with this complaint when you were on site, still has the problem.
>
> 4747/A1-16146 Neither the BIOS nor the OS detect a hard disk at all. I don't
> believe this machine has been back for repair yet, so it could simply be a
> bad Seagate drive.
>
> 4809/A1-16208 I can't reach this machine at all (host or IPMI) I'll verify
> power and network next time I'm on site. (could be tonight)
>
> So we still have one machine that wasn't fixed by the new heatsink/fan
> assembly. Not sure what to make of that. The results otherwise are
> encouraging, even the Seagate Desktop drives seem happy.
>
If I read this correctly, of the 10 machines with new heatsinks:
6 fixed
1 still slow
1 unable to boot
1 bad drive
1 unreachable
Its unclear if I should be worried by this 40% fail rate.
Is the plan to understand the 40% before we start the work in bug#668395? Or is the plan to do the work in bug#668395 so that at least 60% are ok, and then investigate the remaining 40%?
Assignee | ||
Comment 120•13 years ago
|
||
(In reply to comment #119)
> Its unclear if I should be worried by this 40% fail rate.
You should not, because 3 of the 4 failures are entirely unrelated to the heatsink. The only one that would have anything to do with the heatsink is the one that's still slow.
> Is the plan to understand the 40% before we start the work in bug#668395?
Yes, 3 of the 4 were dropped off at iX, as described in the first paragraph of comment #118:
> I suspect that 4715 is
> being affected by its neighbors. I've sent it back to iX, but I have every
> reason to expect that it will test fine at iX. 4708 and 4747 are also at iX
> for repair.
Returning to comment #119:
> Or is the plan to do the work in bug#668395 so that at least 60% are ok, and
> then investigate the remaining 40%?
I don't think extrapolating that failure rate is in any way valid.
Comment 121•13 years ago
|
||
I'm quite confident, but in an effort to be overly cautious, I'm doing some extra digging. First, I've noted a lot of spurious-interrupt log messages on the six repaired systems in comment 117, but on investigation also found them on fully-functional systems. I have written some explanation up here:
https://bugzilla.mozilla.org/show_bug.cgi?id=652962
or just see
http://lkml.org/lkml/2006/12/27/174
Second, I finally found a decent IO stress tool:
http://weather.ou.edu/~apw/projects/stress/
These are now running in screen sessions on all of the six fixed hosts from comment 117. I've already checked hdparm times, and they are consistently high. I'll keep an eye on these and report any problems.
Comment 122•13 years ago
|
||
(In reply to comment #120)
> (In reply to comment #119)
>
> > Its unclear if I should be worried by this 40% fail rate.
> You should not, because 3 of the 4 failures are entirely unrelated to the
> heatsink. The only one that would have anything to do with the heatsink is
> the one that's still slow.
> > Is the plan to understand the 40% before we start the work in bug#668395?
>
> Yes, 3 of the 4 were dropped off at iX, as described in the first paragraph
> of comment #118:
ok, so 3 of the 10 had unrelated hardware problems.
What about the 1 of 10 which didnt have known hardware problems, yet wasnt fixed by the heatsink change?
(I agree having 6 fixed by the heatsink change is a big improvement over today. I just want to make sure we understand all the details and have the same expectations before the proposed big-bulk-repair project starts.)
Assignee | ||
Comment 123•13 years ago
|
||
> What about the 1 of 10 which didnt have known hardware problems, yet wasnt
> fixed by the heatsink change?
Specifically which one are you referring to?
Comment 124•13 years ago
|
||
John, it sounds like you're asking about 4715, and that's already in the bug (twice, actually). From comment 118 and comment 120:
> The testing we've done seems to validate the fix. I suspect that 4715
> is being affected by its neighbors. I've sent it back to iX, but I
> have every reason to expect that it will test fine at iX.
To the broader point, yes, we understand all the details and have looked (in rather excruciating depth) at the worst- and best-case scenarios here. There's stress-testing ongoing in bug 668395. Some portion of the known-bad hosts may still be bad after the HSF repair. The experience with this set of 10 informs the confidence intervals - in my mind, worst case is 80% success rate, best case is 95%, but reasonable people support other numbers :)
As for known-good hardware, it's hard to see how this change could have a significant negative effect, but we will be watching for such during the repair work.
If you're interested in more details, I can supply those to you offline, but the important point in the bug is that we have considered them.
Comment 125•13 years ago
|
||
I haven't seen any ill effects from the stress testing, and read speeds per hdparm are still in the low 100's, even with the stress tests running.
The test in question is
cd /builds/ && ./stress --io 2 --hdd 2 --hdd-bytes 10GB
by the way.
I'll leave the tests running over the (5-day for me) weekend.
Comment 126•13 years ago
|
||
(In reply to comment #123)
> > What about the 1 of 10 which didnt have known hardware problems, yet wasnt
> > fixed by the heatsink change?
>
> Specifically which one are you referring to?
Per meeting with zandr on thursday:
1) the only machine that didnt improve after fixing was 4715.
2) the state of 4715 is not a concern to zandr because he believes it is being impacted by the vibrations of its neighbors. zandr believes that once all the machines have their fans replaced, this collective vibration will reduce and 4715 will start to work properly. This info was not in previous responses, hence the re-ask.
Comment 127•13 years ago
|
||
(In reply to comment #126)
> (In reply to comment #123)
> > > What about the 1 of 10 which didnt have known hardware problems, yet wasnt
> > > fixed by the heatsink change?
> >
> > Specifically which one are you referring to?
>
> Per meeting with zandr on thursday:
>
> 1) the only machine that didnt improve after fixing was 4715.
>
> 2) the state of 4715 is not a concern to zandr because he believes it is
> being impacted by the vibrations of its neighbors. zandr believes that once
> all the machines have their fans replaced, this collective vibration will
> reduce and 4715 will start to work properly. This info was not in previous
> responses, hence the re-ask.
3) Replacing the fans is expected to solve the machine vibration issue. Hence, the drives would explicitly not be upgraded as had been proposed earlier. (A few replacement drives will be on hand, in case dead machines are discovered during the big fan-heatsink-upgrade, but thats a like-with-like replacement.)
Comment 128•13 years ago
|
||
I just peeked in on w64-ix-slave25 and its stress testing seems to be doing just fine.
Comment 129•13 years ago
|
||
I've stopped the tests now. All were still running without errors.
Assignee | ||
Comment 130•13 years ago
|
||
Updates from iX Systems on the three problem machines:
A1-16114 / 04715 - This system's drive was exhibiting fluctuating numbers, so we ended up replacing the drive. Since the swap, the system has been running steady around 130 Mbps.
A1-16107 / 04708 - It took some time but we managed to get this system to reboot. You mentioned it would shut itself down completely? Since the reboot was occurring during our PXE kick-start, we suspected the memory and swapped all of the dimms. The system has been burning fine since and is continuing to exhibit stability.
A1-16146 / 04747 - This system had a troublesome drive, so we replaced it with a new unit. We've also upgraded the memory in this system and it is exhibiting stability as well.
Comment 131•13 years ago
|
||
Great! Will IX bring those along when we do bug 668395, or sooner?
Comment 132•13 years ago
|
||
The problem machines in comment 130 were returned and racked today, and will get new bugs for setup.
More to the point, bug 668395 was completed (it's still open for questions, but it's done), which means that this bug is complete.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 133•13 years ago
|
||
We'll need a new bug for the mtv1 iX hosts.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 134•13 years ago
|
||
(In reply to comment #133)
> We'll need a new bug for the mtv1 iX hosts.
Yes, but this bug is not dependent on that one, as there are no slow or failing drives in mtv1.
Resolved/Fixed is correct, the slow and failing drives are fixed. Converting the machines in mtv1 is for completeness/consistency, not to fix any operational issues.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•