Closed Bug 596366 (ix-drive-issues) Opened 14 years ago Closed 13 years ago

latest batch of ix machines have slow and failing drives

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

x86
All
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: zandr)

References

Details

(Whiteboard: [buildslaves][hardware][duptome][subject to embargo])

Attachments

(5 files)

I noticed during the 3.6.10 release that the latest batch of ix machines seem to run slower in terms of disk speed than the other ones. Some timing from hdparm: [root@mv-moz2-linux-ix-slave03 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29532 MB in 2.00 seconds = 14801.87 MB/sec Timing buffered disk reads: 360 MB in 3.00 seconds = 119.85 MB/sec [root@mv-moz2-linux-ix-slave03 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29524 MB in 2.00 seconds = 14797.74 MB/sec Timing buffered disk reads: 360 MB in 3.01 seconds = 119.80 MB/sec [root@mv-moz2-linux-ix-slave03 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29480 MB in 2.00 seconds = 14776.34 MB/sec Timing buffered disk reads: 356 MB in 3.01 seconds = 118.21 MB/sec --------------- [root@linux-ix-slave07 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29336 MB in 1.99 seconds = 14738.42 MB/sec Timing buffered disk reads: 256 MB in 3.02 seconds = 84.76 MB/sec [root@linux-ix-slave07 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29336 MB in 1.99 seconds = 14738.13 MB/sec Timing buffered disk reads: 262 MB in 3.01 seconds = 86.98 MB/sec [root@linux-ix-slave07 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29332 MB in 1.99 seconds = 14738.71 MB/sec Timing buffered disk reads: 270 MB in 3.03 seconds = 89.09 MB/sec As far as I can tell, they're set-up exactly the same as the other ones, down to the hard drive firmware level. The filesystems are ext3, mounted with noatime. Haven't dug further than this.
Whiteboard: [buildslaves][hardware]
mrz: I thought these machines were identical to the last batch. What is different about these new ix machines?
Assignee: nobody → mrz
Status: ASSIGNED → NEW
Component: Release Engineering → Server Operations
OS: Mac OS X → All
QA Contact: release → mrz
Nothing.
Assignee: mrz → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
linux-ix-slave17 is repeatedly getting hg into an uninterruptible sleep when cloning a mozilla-1.9.2 for 3.6.11 tagging. Breaks tagging and requires a reboot. We should get IT to run diagnostics on at least one of these machines.
Severity: normal → major
Depends on: 601123
I strongly suspect this bug for wasting most of my day today. (timeout after rm -rf of ~17mb took >20min; 20min timeouts in several compiles in a row on linux-ix-slave02 that worked perfectly on mv-moz2-linux-ix-slave01.)
I did a quick set of tests and it looks like this might be a more widespread issue affecting all new IX boxes. While nothing else was running on the machines, I ran the following two commands on both the linux and win32 machines: time hg clone http://hg.mozilla.org/mozilla-central freshclone time hg clone --pull --uncompressed freshclone copy I found that the new batch of machines are in both cases slower than the original batch of ix machines. On linux-ix-slave02, the second command took 10 times longer than the old machines. The windows tests showed that the local clone operation took nearly twice as long. The breakdown of real, user and sys times was only available on the linux machines. More detailed results below. Win32 ==================================================== on mw32-ix-slave01, hg clone http://.../mozilla-central freshclone took 12m38. on w32-ix-slave02, hg clone http://.../mozilla-central freshclone took 14m31. on mw32-ix-slave01, hg clone --pull --uncompressed freshclone copy took 10m50 on w32-ix-slave02, hg clone --pull --uncompressed freshclone copy took 18m52 Linux ==================================================== on mv-moz2-linux-ix-slave04, hg clone http://.../mozilla-central freshclone took real 4m5 user 2m38 sys 0m11 on linux-ix-slave02, hg clone http://.../mozilla-central freshclone took real 43m27 user 3m0 sys 0m11 on mv-moz2-linux-ix-slave04, hg clone --pull --uncompressed freshclone copy took real 4m35 user 3m23 sys 0m12 on linux-ix-slave02, hg clone --pull --uncompressed freshclone copy took real 14m56 user 3m49 sys 0m12
Summary: latest batch of linux ix machines seem to have slower disks → latest batch of ix machines have slow i/o
jabba/jlazaro: from comment#2, mrz asserts the hardware is identical. Is there any diagnostics that can be run on these to explain the performance different? Or is there anything different about how these machines were imaged ? Marking as critical, as its causing intermittent timeouts/hangs in production.
Assignee: nobody → server-ops
Severity: major → critical
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Assignee: server-ops → jlazaro
Contacted IX support via email, since this is hardware related
linux-ix-slave16 was taken out of production by nthomas last night, because its was taking 6 hours for a Linux maple leak test build (clobber) last night. See attached bug#601623 for history of linux-ix-slave16 being sick a couple of weeks ago.
See Also: → 601623
Just took mv-moz2-linux-ix-slave02 and linux-ix-slave31 offline to loan out for investigation.
also handed off linux-ix-slave14
Confirming Lukas's comment mv-moz2-linux-slave02 linux-ix-slave14 linux-ix-slave31 (scl) These machines were taken by Chris Williams from IX Systems today to investigate the i/o issues, will report back when I receive an update from IX
Assignee: jlazaro → server-ops
Assignee: server-ops → jlazaro
from email with matt@ix systems: They've confirmed performance differences on the few machines they took back from office to test. More debugging ongoing.
Some production fallout in bug 606716.
Blocks: 606716
In order for IX to continue debugging these issues, we'll need to run tests on 37 machines with these serial numbers: Group A A1-14132 A1-14134 A1-14136 A1-14138 A1-14139 A1-14141 A1-16051 A1-16063 A1-16094 A1-16098 A1-16105 A1-16188 A1-16189 Group B A1-14147 A1-14154 A1-14168 A1-14171 A1-14174 A1-14175 A1-16056 A1-16114 A1-16132 A1-16171 A1-16205 A1-16213 Group C A1-14128 A1-14145 A1-14146 A1-14152 A1-14153 A1-14166 A1-16061 A1-16082 A1-16095 A1-16149 A1-16151 A1-16212 Test: hdparm -I /dev/sda hdparm -tT /dev/sda We're hoping most of the machines from each group are linux machines since we don't have a tool for testing i/o on Windows. Would we need to schedule a downtime for this? Is this a concern that we won't have accurate results if these machines are in active production?
hdparm is available for windows iirc. I dont know how to map those serials to hostnames, do we have a way to do that? I can remove the machines from production so that we get mostly-idle testing done.
Attached file hdparm for windows —
Here is hdparm for windows. It requires administrator permissions to run and will require cygwin1.dll to either be in the same directory or in the %PATH% system variable. I have checked and it looks like cygwin1.dll is the only dependency (other than kernel32.dll).
It looks like joduinn/buildduty will be working to get these tests/results to IX. Although these machines are in inventory, the "quick search" option does not allow us to search by serial number ( https://bugzilla.mozilla.org/show_bug.cgi?id=607050 ) We might have a spreadsheet with the hostnames and serial numbers to reference by, and forward that to buildduty/joduinn once I find this info.
Assignee: jlazaro → joduinn
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Do you have the list of slaves which those serial numbers?
Please throw back to Release Engineering when you have the list.
Assignee: joduinn → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Assignee: server-ops → jlazaro
Group A A1-14132 mv-moz2-linux-ix-slave12 A1-14134 mw32-ix-slave17 A1-14136 mw32-ix-slave13 A1-14138 mv-moz2-linux-ix-slave15 A1-14139 mv-moz2-linux-ix-slave02 A1-14141 mv-moz2-linux-ix-slave11 A1-16051 w32-ix-slave05 A1-16063 w32-ix-slave17 A1-16094 w32-ix-slave31 A1-16098 w32-ix-slave35 A1-16105 w32-ix-slave42 A1-16188 linux64-ix-slave16 A1-16189 linux64-ix-slave17 Group B A1-14147 mv-moz2-linux-ix-slave08 A1-14154 mw64-ix-slave01 A1-14168 mw32-ix-slave07 A1-14171 mw32-ix-slave05 A1-14174 mw32-ix-slave10 A1-14175 mw32-ix-slave18 A1-16056 w32-ix-slave10 A1-16114 w64-ix-slave09 A1-16132 w64-ix-slave27 A1-16171 linux-ix-slave41 A1-16205 linux64-ix-slave33 A1-16213 linux64-ix-slave41 Group C A1-14128 mw32-ix-slave23 A1-14145 mw32-ix-slave11 A1-14146 mw32-ix-slave22 A1-14152 mv-moz2-linux-ix-slave23 A1-14153 mv-moz2-linux-ix-slave19 A1-14166 mv-moz2-linux-ix-slave13 A1-16061 w32-ix-slave15 A1-16082 linux-ix-slave11 A1-16095 w32-ix-slave32 A1-16149 linux-ix-slave19 A1-16151 linux-ix-slave21 A1-16212 linux64-ix-slave40
Assignee: jlazaro → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
I will start looking at this
Assignee: nobody → jhford
[root@mv-moz2-linux-ix-slave12 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 5VY0LB7E Firmware Revision: CC45 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 208, current value: 208 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@mv-moz2-linux-ix-slave12 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29384 MB in 1.99 seconds = 14728.98 MB/sec Timing buffered disk reads: 376 MB in 3.00 seconds = 125.13 MB/sec
A1-14134 mw32-ix-slave17 A1-14136 mw32-ix-slave13 are both unreachable
[root@mv-moz2-linux-ix-slave15 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 5VY17LN3 Firmware Revision: CC45 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 208, current value: 208 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@mv-moz2-linux-ix-slave15 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29336 MB in 2.00 seconds = 14704.02 MB/sec Timing buffered disk reads: 278 MB in 3.00 seconds = 92.62 MB/sec
A1-14139 mv-moz2-linux-ix-slave02 is unreachable
[root@mv-moz2-linux-ix-slave11 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 5VY0LAK8 Firmware Revision: CC45 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 208, current value: 208 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@mv-moz2-linux-ix-slave11 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29480 MB in 1.99 seconds = 14776.95 MB/sec Timing buffered disk reads: 370 MB in 3.01 seconds = 123.04 MB/sec
Whiteboard: [buildslaves][hardware] → [buildslaves][hardware][triagefollowup]
This is pretty important and we need to make progress here. Can we start taking these machines offline in batches, i.e. gracefully shutdown one slave from each platform (linux, linux64, win32, win64), run the diagnostics, add those slave back to the pool, and then move on to the next batch? Also, it's probably not ideal to post the results for each slave in the bug. I'd suggest creating a subdir for the output logs on people.mozilla.com and linking to it from the bug. Not fun, I realize, but required.
Priority: -- → P3
Whiteboard: [buildslaves][hardware][triagefollowup] → [buildslaves][hardware]
I really think we should put some time into this bug ? Some examples of wrongness I've seen today * linux-ix-slave13 taking 2+ hours to do a 1.9.2 unit test build, holding rs up * w32-ix-slave16 taking 4hrs 20 mins to compile a try opt build Need some data so that IX know what to fix up.
maybe we could do this during the mega-downtime this comming friday
Assignee: jhford → nobody
(In reply to comment #29) > maybe we could do this during the mega-downtime this comming friday Sure, but let's be specific here: * who's going to be around to do this on Friday, given that many of us are traveling? * will it be IT or RelEng running the tests? * is it just hdparm output we're looking for, or are there other tests we could/should be running?
I think I posted this somewhere else, but can't find it right now... Since the IX machines can boot off an image provided via the ipmi interface, we could boot off something like http://www.sysresccd.org/, which provides hdparm.
(In reply to comment #30) > (In reply to comment #29) > > maybe we could do this during the mega-downtime this comming friday > > Sure, but let's be specific here: > > * who's going to be around to do this on Friday, given that many of us are > traveling? > * will it be IT or RelEng running the tests? Debugging this hardware difference is something for IT. Pushing to zandr after talking with him on irc. > * is it just hdparm output we're looking for, or are there other tests we > could/should be running?
Assignee: nobody → zandr
zandr: if we need to take (more of) these out of service at some point to get this done, just let me know.
Component: Release Engineering → Server Operations
QA Contact: release → mrz
At this point, I'd like to circle back with iX and see what they're thinking. We've had a number of outright failures, and this is a lot of data collection.
(In reply to comment #31) > I think I posted this somewhere else, but can't find it right now... > > Since the IX machines can boot off an image provided via the ipmi interface, we > could boot off something like http://www.sysresccd.org/, which provides hdparm. Good call, I've put an image at \Users\administrator\public\sysresccd-x86-1.6.4.iso that works well enough. from linux64-ix-slave33: root@sysresccd /root % hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 6VMGX2N0 Firmware Revision: CC38 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) cache/buffer size = 8192 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 46min for SECURITY ERASE UNIT. 46min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50027fec700 NAA : 5 IEEE OUI : 000c50 Unique ID : 027fec700 Checksum: correct root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13764 MB in 2.00 seconds = 6898.63 MB/sec Timing buffered disk reads: 290 MB in 3.01 seconds = 96.24 MB/sec root@sysresccd /root %
Known failures on: linux-ix-slave13 bug 619624 linux-ix-slave33 bug 602288 linux-ix-slave35 bug 602288 w32-ix-slave41 bug 615744
Depends on: 615744, 602288, 619624
linux64-ix-slave40: root@sysresccd /root % hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 9VMKCQF2 Firmware Revision: CC46 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) cache/buffer size = 8192 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 44min for SECURITY ERASE UNIT. 44min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50026db6f64 NAA : 5 IEEE OUI : 000c50 Unique ID : 026db6f64 Checksum: correct root@sysresccd /root % hdparm -Tt /dev/sda /dev/sda: Timing cached reads: 13892 MB in 2.00 seconds = 6962.22 MB/sec Timing buffered disk reads: 284 MB in 3.01 seconds = 94.27 MB/sec root@sysresccd /root %
Not on the list, but a reported failure from https://bugzilla.mozilla.org/show_bug.cgi?id=620948#c22 [root@linux-ix-slave34 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 6VY7450E Firmware Revision: CC38 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@linux-ix-slave34 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29340 MB in 1.99 seconds = 14740.90 MB/sec Timing buffered disk reads: 58 MB in 3.12 seconds = 18.58 MB/sec
A1-16213 linux64-ix-slave41 root@sysresccd /root % hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 6VMGSAD1 Firmware Revision: CC38 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) cache/buffer size = 8192 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 44min for SECURITY ERASE UNIT. 44min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50027f5e723 NAA : 5 IEEE OUI : 000c50 Unique ID : 027f5e723 Checksum: correct root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13676 MB in 2.00 seconds = 6854.45 MB/sec Timing buffered disk reads: 278 MB in 3.01 seconds = 92.45 MB/sec root@sysresccd /root %
A1-16188 linux64-ix-slave16 root@sysresccd /root % hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 5VMF041Q Firmware Revision: CC38 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) cache/buffer size = 8192 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50027f25694 NAA : 5 IEEE OUI : 000c50 Unique ID : 027f25694 Checksum: correct root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13468 MB in 2.00 seconds = 6749.66 MB/sec Timing buffered disk reads: 84 MB in 3.02 seconds = 27.78 MB/sec root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13826 MB in 2.00 seconds = 6929.62 MB/sec Timing buffered disk reads: 82 MB in 3.02 seconds = 27.19 MB/sec root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13838 MB in 2.00 seconds = 6935.35 MB/sec Timing buffered disk reads: 84 MB in 3.02 seconds = 27.85 MB/sec root@sysresccd /root %
A1-16189 linux64-ix-slave17 root@sysresccd /root % hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 5VMEYY27 Firmware Revision: CC38 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) cache/buffer size = 8192 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 44min for SECURITY ERASE UNIT. 44min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50027ecb27d NAA : 5 IEEE OUI : 000c50 Unique ID : 027ecb27d Checksum: correct root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13828 MB in 2.00 seconds = 6930.66 MB/sec Timing buffered disk reads: 288 MB in 3.02 seconds = 95.49 MB/sec root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13764 MB in 2.00 seconds = 6897.89 MB/sec Timing buffered disk reads: 296 MB in 3.02 seconds = 98.16 MB/sec root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13782 MB in 2.00 seconds = 6907.71 MB/sec Timing buffered disk reads: 290 MB in 3.01 seconds = 96.36 MB/sec root@sysresccd /root %
A1-16114 w64-ix-slave09 root@sysresccd /root % hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 6VMGJETD Firmware Revision: CC38 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) cache/buffer size = 8192 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 46min for SECURITY ERASE UNIT. 46min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50027e09cf2 NAA : 5 IEEE OUI : 000c50 Unique ID : 027e09cf2 Checksum: correct root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13688 MB in 2.00 seconds = 6859.80 MB/sec Timing buffered disk reads: 316 MB in 3.01 seconds = 104.99 MB/sec root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13646 MB in 2.00 seconds = 6839.39 MB/sec Timing buffered disk reads: 316 MB in 3.01 seconds = 104.98 MB/sec root@sysresccd /root % hdparm -tT /dev/sda /dev/sda: Timing cached reads: 13706 MB in 2.00 seconds = 6869.15 MB/sec Timing buffered disk reads: 314 MB in 3.01 seconds = 104.40 MB/sec root@sysresccd /root %
In comment #3, bhearsum mentions linux-ix-slave17 as a slave encountering this problem, but I don't see it listed in the list of slave in comment #20. Is the list in comment #20 meant to be exhaustive?
The list in comment #20 is the list of machines iX systems wanted to sample, not the list of problem machines. Many of them do not have performance problems at all. But I haven't tested a lot of them yet. Spreadsheet showing testing and failures here: https://spreadsheets.google.com/ccc?key=0AqPtmipKTyewdFNHRzVnZ1RhOFNsWmdmbjV4UmRwLXc&authkey=CMHYj7kD&hl=en#gid=0 Will need to work with buildduty to take those machines out of service for testing, but I'm currently prioritizing that below the nagios problem.
Making this the bug for all the ix-slave drive issues.
Whiteboard: [buildslaves][hardware] → [buildslaves][hardware][duptome]
Summary: latest batch of ix machines have slow i/o → latest batch of ix machines have slow and failing drives
Alias: ix-drive-issues
I just linked to this bug: * linux-ix-slave01 (bug 624371) * linux-ix-slave06 (bug 624210)
I also suspect w32-ix-slave23
w32-ix-slave07 is also pretty slow. About 6 hours for a try opt build is well over the expected value. Still in prod at the moment.
zandr: any progress here?
Not really. Going to collate the long list of machines you guys have marked as broken and re-ping iX.
[root@linux-ix-slave16 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 6VY6FJND Firmware Revision: CC38 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@linux-ix-slave16 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 12560 MB in 2.01 seconds = 6240.93 MB/sec Timing buffered disk reads: 88 MB in 3.01 seconds = 29.20 MB/sec
[root@linux-ix-slave31 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 5VMF5BG2 Firmware Revision: CC38 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@linux-ix-slave31 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29332 MB in 1.99 seconds = 14737.70 MB/sec Timing buffered disk reads: 342 MB in 3.00 seconds = 113.82 MB/sec [root@linux-ix-slave31 ~]#
(In reply to comment #11) > Confirming Lukas's comment > > mv-moz2-linux-slave02 > linux-ix-slave14 > > linux-ix-slave31 (scl) > > These machines were taken by Chris Williams from IX Systems today to > investigate the i/o issues, will report back when I receive an update from IX These machines are back, linux-ix-slave14 and 31 are now in scl1. -31 looks fast now. I'll get mv-..slave02 back up today.
Oooh, this is ugly: [root@linux-ix-slave34 ~]# hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: ST3250318AS Serial Number: 6VY7450E Firmware Revision: CC38 Transport: Serial Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 40min for SECURITY ERASE UNIT. 40min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [root@linux-ix-slave34 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29628 MB in 1.99 seconds = 14885.19 MB/sec Timing buffered disk reads: read(2097152) returned 499712 bytes [root@linux-ix-slave34 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 29336 MB in 1.99 seconds = 14737.86 MB/sec Timing buffered disk reads: read(2097152) returned 503808 bytes [root@linux-ix-slave34 ~]#
From bug 624210: (In reply to comment #6) > Sad, but not very sad. Here's the usual test data for iX: > > [root@linux-ix-slave06 ~]# hdparm -I /dev/sda > > /dev/sda: > > ATA device, with non-removable media > Model Number: ST3250318AS > Serial Number: 9VY95DR1 > Firmware Revision: CC38 > Transport: Serial > Standards: > Supported: 8 7 6 5 > Likely used: 8 > Configuration: > Logical max current > cylinders 16383 16383 > heads 16 16 > sectors/track 63 63 > -- > CHS current addressable sectors: 16514064 > LBA user addressable sectors: 268435455 > LBA48 user addressable sectors: 488397168 > device size with M = 1024*1024: 238475 MBytes > device size with M = 1000*1000: 250059 MBytes (250 GB) > Capabilities: > LBA, IORDY(can be disabled) > Queue depth: 32 > Standby timer values: spec'd by Standard, no device specific minimum > R/W multiple sector transfer: Max = 16 Current = ? > Recommended acoustic management value: 254, current value: 0 > DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 *udma3 udma4 udma5 udma6 > Cycle time: min=120ns recommended=120ns > PIO: pio0 pio1 pio2 pio3 pio4 > Cycle time: no flow control=120ns IORDY flow control=120ns > Commands/features: > Enabled Supported: > * SMART feature set > Security Mode feature set > * Power Management feature set > * Write cache > * Look-ahead > * Host Protected Area feature set > * WRITE_BUFFER command > * READ_BUFFER command > * DOWNLOAD_MICROCODE > SET_MAX security extension > * Automatic Acoustic Management feature set > * 48-bit Address feature set > * Device Configuration Overlay feature set > * Mandatory FLUSH_CACHE > * FLUSH_CACHE_EXT > * SMART error logging > * SMART self-test > * General Purpose Logging feature set > * WRITE_{DMA|MULTIPLE}_FUA_EXT > * 64-bit World wide name > Write-Read-Verify feature set > * WRITE_UNCORRECTABLE command > * {READ,WRITE}_DMA_EXT_GPL commands > * Segmented DOWNLOAD_MICROCODE > * SATA-I signaling speed (1.5Gb/s) > * SATA-II signaling speed (3.0Gb/s) > * Native Command Queueing (NCQ) > * Phy event counters > Device-initiated interface power management > * Software settings preservation > Security: > Master password revision code = 65534 > supported > not enabled > not locked > not frozen > not expired: security count > supported: enhanced erase > 42min for SECURITY ERASE UNIT. 42min for ENHANCED SECURITY ERASE UNIT. > Checksum: correct > [root@linux-ix-slave06 ~]# hdparm -tT /dev/sda > > /dev/sda: > Timing cached reads: 29336 MB in 1.99 seconds = 14738.11 MB/sec > Timing buffered disk reads: 280 MB in 3.01 seconds = 93.00 MB/sec
w32-ix-slave41 is incapable of cloning large repositories and feels *very* slow overall.
(In reply to comment #57) > (In reply to comment #11) > > Confirming Lukas's comment > > > > mv-moz2-linux-slave02 > > linux-ix-slave14 > > > > linux-ix-slave31 (scl) > > > > These machines were taken by Chris Williams from IX Systems today to > > investigate the i/o issues, will report back when I receive an update from IX > > These machines are back, linux-ix-slave14 and 31 are now in scl1. -31 looks > fast now. I'll get mv-..slave02 back up today. Not sure what happened exactly, but linux-ix-slave31 was burning at least one build. ~cltbld/.ssh was owned by root and had staging keys. I chowned it to cltbld and did an rsync from linux-ix-slave32:.ssh/ (had to rsync with --rsh="ssh -oBatchMode=no").
Blocks: 624210
No longer depends on: 624210
I've gone back and forth a couple of times with iX Systems about this. They believe that this is caused by vibration from the fans. They had initially suggested that going to lower speed fans would be a suitable solution, but after further discussion we'd rather not reduce cooling in the machines. We're going to replace the fans in known bad machines, including testing the w64 and linux64 stacks. and go from there. The problem is reproducible at iX, and they have adjusted their burnin procedures to catch this.
Assigning to Spencer to collect data on the machines that aren't in service yet. w64-ix-slave07 through slave41 linux64-ix-slave01 through slave41 On each of these machines: 1. Boot through IPMI into sysresccd as a Virtual CD Image. 2. run 'hdparm -I /dev/sda' and 'hdparm -tT /dev/sda' and capture the output This is easiest if you set a root password once you've booted into sysresccd and then ssh in. Please run through these machines as you can and put the data in attachments. In particular, we're looking for any machines where hdparm -tT reports buffered disk reads < 90MB/sec.
Assignee: zandr → shui
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
w64-ix-slave07 through slave41 data have been collected started on linux64-ix-slave01 through slave41
Not sure if this is still the right place to add new things, but w32-ix-slave02 is much slower than other windows IX machine, I think it needs a new fan.
Data collected from Linux 64 1-20
Data from Linux 64 21-41
Data from W64 7-22
Data from w64 23-41
Assignee: shui → server-ops-releng
(In reply to comment #52) > w32-ix-slave07 is also pretty slow. About 6 hours for a try opt build is well > over the expected value. Still in prod at the moment. Pulled out.
[subject to embargo] because this involves yanking machines out of racks that are full of production boxes. I'm not willing to risk pulling the wrong cable after walking around a row of racks.
Whiteboard: [buildslaves][hardware][duptome] → [buildslaves][hardware][duptome][subject to embargo]
1) zandr to parse attachments to create list of machines to be pulled and shipped back to IX systems. For everyone else's sanity, a clear list will be posted to this bug! :-) 2) IX systems planning on replacing defective fans on machines.
(In reply to comment #74) > 1) zandr to parse attachments to create list of machines to be pulled and > shipped back to IX systems. For everyone else's sanity, a clear list will be > posted to this bug! :-) (clicked too soon: physically pulling the machines is subject to embargo, but creating the list is not.) > 2) IX systems planning on replacing defective fans on machines.
w32-ix-slave08 has a failed drive "S.M.A.R.T. status BAD" in BIOS
The list of slaves I have with slow IO is: linux-ix-slave33 linux-ix-slave34 linux-ix-slave35 mv-moz2-linux-ix-slave12 w32-ix-slave08 w32-ix-slave23 I will shut all of these machines down right now. Zandr, if you can union this with your list and post here, I will make sure that any still-running machines are safely shut down.
List of the machines that run below 90MB/s in the hdparm -tT result linux64-ix-slave04 89.98 MB/sec linux64-ix-slave10 69.91 MB/sec linux64-ix-slave11 21.27 MB/sec linux64-ix-slave16 73.21 MB/sec w64-ix-slave07 85.02 MB/sec w64-ix-slave11 58.65 MB/sec w64-ix-slave23 92.39 MB/sec
(In reply to comment #78) > List of the machines that run below 90MB/s in the hdparm -tT result > w64-ix-slave23 92.39 MB/sec 92 > 90 ?
So the combined list, not considering 92 as less than 90 (accuracy, accuracy, accuracy!!) is: linux-ix-slave33 - shut down linux-ix-slave34 - shut down linux-ix-slave35 - shut down linux64-ix-slave04 - not in production, ok to yank power linux64-ix-slave10 - not in production, ok to yank power linux64-ix-slave11 - not in production, ok to yank power linux64-ix-slave16 - not in production, ok to yank power mv-moz2-linux-ix-slave12 - shut down w32-ix-slave08 - shut down w32-ix-slave23 - shut down w64-ix-slave07 - not in production, ok to yank power w64-ix-slave11 - not in production, ok to yank power w64-ix-slave23 - not in production, ok to yank power so I'm happy to see this bunch uncabled and shipped off at your collective convenience. Then let's close this bug and re-open slave-specific bugs for any subsequent failures we see. Zandr, is there something in place with IT to track slaves that have been sent to IX so that we don't double-send one?
Also found: linux64-ix-slave12 wasn't tested (the file by that name contains linux64-ix-slave11's results) linux64-ix-slave13 wasn't tested (the file does not contain hdparm -tT results) linux64-ix-slave17 wasn't tested (file contains slave16's results) linux64-ix-slave26 wasn't tested (file contains slave25's results) w64-ix-slave29 wasn't tested (file contains slave27's results) w64-ix-slave31 wasn't tested (file contains slave30's results) w64-ix-slave36 wasn't tested (file contains slave35's results) w64-ix-slave37 wasn't tested (file contains slave30's results) Spencer, please retest these 8 machines.
My unified list, rolling up all the failures I know about (including comment 80): linux-ix-slave01: bug 624371 linux-ix-slave06: bug 624210 linux-ix-slave13: bug 619624 linux-ix-slave16: comment 8 linux-ix-slave17: comment 55 linux-ix-slave33: bug 620124 linux-ix-slave34: comment 38, comment 58 linux-ix-slave35: bug 620124 linux-ix-slave42: bug 624207 linux64-ix-slave04: comment 78 linux64-ix-slave10: comment 78 linux64-ix-slave11: comment 78 linux64-ix-slave16: comment 78 mv-moz2-linux-ix-slave02: w32-ix-slave07: w32-ix-slave08: bug 635416#c31 w32-ix-slave41: bug 615744 w64-ix-slave02: bug 638814 w64-ix-slave07: comment 78 w64-ix-slave11: comment 78 Dustin- could you verify that the additional machines are all out of service?
Assignee: server-ops-releng → shui
linux64-ix-slave11 37.61 MB/sec linux64-ix-slave12 29.97 MB/sec linux64-ix-slave13 29.97 MB/sec linux64-ix-slave26 bad drive
(In reply to comment #82) > My unified list, rolling up all the failures I know about (including comment > 80): > mv-moz2-linux-ix-slave02: Typo, that's mv-moz2-linux-ix-slave12 Will comment with complete list when I get the test data back from spencer.
Final consolidated list: linux-ix-slave01 bug 624371 A1-16072 4620 scl1 linux-ix-slave06 bug 624210 A1-16077 4625 scl1 linux-ix-slave13 bug 619624 A1-16084 4632 scl1 linux-ix-slave01: bug 624371 A1-16072 4620 scl1 linux-ix-slave06: bug 624210 A1-16077 4625 scl1 linux-ix-slave13: bug 619624 A1-16077 4632 scl1 linux-ix-slave16: comment 8 A1-16087 4635 scl1 linux-ix-slave17: comment 55 A1-16088 4636 scl1 linux-ix-slave33: bug 620124 A1-16163 4674 scl1 linux-ix-slave34: comment 58 A1-16164 4675 scl1 linux-ix-slave35: bug 620124 A1-16165 4676 scl1 linux-ix-slave42: bug 624207 A1-16172 4773 scl1 linux64-ix-slave04: comment 78 A1-16176 4777 scl1 linux64-ix-slave10: comment 78 A1-16182 4783 scl1 linux64-ix-slave11: comment 78 A1-16183 4784 scl1 linux64-ix-slave12: comment 83 A1-16184 4785 scl1 linux64-ix-slave13: comment 83 A1-16185 4786 scl1 linux64-ix-slave16: comment 78 A1-16188 4789 scl1 mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 w32-ix-slave41: bug 615744 A1-16104 4705 scl1 w64-ix-slave02: bug 638814 A1-16107 4708 scl1 w64-ix-slave07: comment 78 A1-16112 4713 scl1 w64-ix-slave11: comment 78 A1-16116 4717 scl1
linux-ix-slave01 bug 624371 A1-16072 4620 scl1 linux-ix-slave06 bug 624210 A1-16077 4625 scl1 linux-ix-slave13 bug 619624 A1-16084 4632 scl1 linux-ix-slave16: comment 8 A1-16087 4635 scl1 all in production linux-ix-slave17: comment 55 A1-16088 4636 scl1 linux-ix-slave33: bug 620124 A1-16163 4674 scl1 linux-ix-slave34: comment 58 A1-16164 4675 scl1 linux-ix-slave35: bug 620124 A1-16165 4676 scl1 all shut down and ready to go linux-ix-slave42: bug 624207 A1-16172 4773 scl1 can't connect, but in staging - yank its power cord linux64-ix-slave04: comment 78 A1-16176 4777 scl1 linux64-ix-slave10: comment 78 A1-16182 4783 scl1 linux64-ix-slave11: comment 78 A1-16183 4784 scl1 linux64-ix-slave12: comment 83 A1-16184 4785 scl1 linux64-ix-slave13: comment 83 A1-16185 4786 scl1 linux64-ix-slave16: comment 78 A1-16188 4789 scl1 all in staging, and yet to be reimaged - yank cord mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 w32-ix-slave41: bug 615744 A1-16104 4705 scl1 all shut down and ready to go w64-ix-slave02: bug 638814 A1-16107 4708 scl1 w64-ix-slave07: comment 78 A1-16112 4713 scl1 w64-ix-slave11: comment 78 A1-16116 4717 scl1 all in staging, and yet to be reimaged - yank cord shouldn't w64-ix-slave23 be on the list, too?
Sorry, that should have ended with a question: do you want to send these all back together, in which case I'll shut down the production slaves, or will you be batching them, in which case let's leave them up since we're short on slaves right now?
Let's leave the ones that are in production up, and send the rest of these off first. When we get them back online, we can shut down the production machines.
w32-ix-slave23 is also missing from comment #85.
(In reply to comment #89) > w32-ix-slave23 is also missing from comment #85. Per comment 78 and comment 79, it's over the (admittedly arbitrary) 90MB/s threshold, and thus not in the list.
Those comments are for w64-ix-slave23. I'm referring to comment #51, where we noticed it was slow in production, and comment #77.
(In reply to comment #91) > Those comments are for w64-ix-slave23. I'm referring to comment #51, where we > noticed it was slow in production, and comment #77. Similar names are similar. Sigh. Thanks, good catch. Added below. linux-ix-slave01 bug 624371 A1-16072 4620 scl1 linux-ix-slave06 bug 624210 A1-16077 4625 scl1 linux-ix-slave13 bug 619624 A1-16084 4632 scl1 linux-ix-slave01: bug 624371 A1-16072 4620 scl1 linux-ix-slave06: bug 624210 A1-16077 4625 scl1 linux-ix-slave13: bug 619624 A1-16077 4632 scl1 linux-ix-slave16: comment 8 A1-16087 4635 scl1 linux-ix-slave17: comment 55 A1-16088 4636 scl1 linux-ix-slave33: bug 620124 A1-16163 4674 scl1 linux-ix-slave34: comment 58 A1-16164 4675 scl1 linux-ix-slave35: bug 620124 A1-16165 4676 scl1 linux-ix-slave42: bug 624207 A1-16172 4773 scl1 linux64-ix-slave04: comment 78 A1-16176 4777 scl1 linux64-ix-slave10: comment 78 A1-16182 4783 scl1 linux64-ix-slave11: comment 78 A1-16183 4784 scl1 linux64-ix-slave12: comment 83 A1-16184 4785 scl1 linux64-ix-slave13: comment 83 A1-16185 4786 scl1 linux64-ix-slave16: comment 78 A1-16188 4789 scl1 mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 w32-ix-slave23: comment 51 A1-16069 4617 scl1 w32-ix-slave41: bug 615744 A1-16104 4705 scl1 w64-ix-slave02: bug 638814 A1-16107 4708 scl1 w64-ix-slave07: comment 78 A1-16112 4713 scl1 w64-ix-slave11: comment 78 A1-16116 4717 scl1
This is back on my plate to get iX to come collect them. Per conversation with Dustin, we'll leave the ones that are in production running and swap them out after we get the others back from iX.
Assignee: shui → zandr
buildbot-master6, nee w64-ix-slave06, has the same 'ata1: spurious interrupt' messages as linux-ix-slave01 and linux-ix-slave13. I think it should also go back to IX.
Should we also consider running a more in depth test like bonnie++?
(In reply to comment #94) > buildbot-master6, nee w64-ix-slave06, has the same 'ata1: spurious interrupt' > messages as linux-ix-slave01 and linux-ix-slave13. I think it should also go > back to IX. Roger, I'll put that in the list. (In reply to comment #95) > Should we also consider running a more in depth test like bonnie++? If you find performance problems that don't show as errors, we should look at it, but I don't see any point in going back through this list of machines and oing more testing. iX has improved their burn-in procedures to catch these sorts of problems and these machines will all go through burnin before they come back from repair.
linux-ix-slave01 bug 624371 A1-16072 4620 scl1 linux-ix-slave06 bug 624210 A1-16077 4625 scl1 linux-ix-slave13 bug 619624 A1-16084 4632 scl1 linux-ix-slave01: bug 624371 A1-16072 4620 scl1 linux-ix-slave06: bug 624210 A1-16077 4625 scl1 linux-ix-slave13: bug 619624 A1-16077 4632 scl1 linux-ix-slave16: comment 8 A1-16087 4635 scl1 linux-ix-slave17: comment 55 A1-16088 4636 scl1 linux-ix-slave33: bug 620124 A1-16163 4674 scl1 linux-ix-slave34: comment 58 A1-16164 4675 scl1 linux-ix-slave35: bug 620124 A1-16165 4676 scl1 linux-ix-slave42: bug 624207 A1-16172 4773 scl1 linux64-ix-slave04: comment 78 A1-16176 4777 scl1 linux64-ix-slave10: comment 78 A1-16182 4783 scl1 linux64-ix-slave11: comment 78 A1-16183 4784 scl1 linux64-ix-slave12: comment 83 A1-16184 4785 scl1 linux64-ix-slave13: comment 83 A1-16185 4786 scl1 linux64-ix-slave16: comment 78 A1-16188 4789 scl1 mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 w32-ix-slave23: comment 51 A1-16069 4617 scl1 w32-ix-slave41: bug 615744 A1-16104 4705 scl1 w64-ix-slave02: bug 638814 A1-16107 4708 scl1 w64-ix-slave06: bug 639628#c22 A1-16111 4712 scl1 w64-ix-slave07: comment 78 A1-16112 4713 scl1 w64-ix-slave11: comment 78 A1-16116 4717 scl1
[root@linux64-ix-slave41 ~]# hdparm -tT /dev/sda /dev/sda: Timing cached reads: 28928 MB in 1.99 seconds = 14532.52 MB/sec Timing buffered disk reads: 160 MB in 3.07 seconds = 52.16 MB/sec So, current list is: linux-ix-slave01 bug 624371 A1-16072 4620 scl1 linux-ix-slave06 bug 624210 A1-16077 4625 scl1 linux-ix-slave13 bug 619624 A1-16084 4632 scl1 linux-ix-slave01: bug 624371 A1-16072 4620 scl1 linux-ix-slave06: bug 624210 A1-16077 4625 scl1 linux-ix-slave13: bug 619624 A1-16077 4632 scl1 linux-ix-slave16: comment 8 A1-16087 4635 scl1 linux-ix-slave17: comment 55 A1-16088 4636 scl1 linux-ix-slave33: bug 620124 A1-16163 4674 scl1 linux-ix-slave34: comment 58 A1-16164 4675 scl1 linux-ix-slave35: bug 620124 A1-16165 4676 scl1 linux-ix-slave42: bug 624207 A1-16172 4773 scl1 linux64-ix-slave04: comment 78 A1-16176 4777 scl1 linux64-ix-slave10: comment 78 A1-16182 4783 scl1 linux64-ix-slave11: comment 78 A1-16183 4784 scl1 linux64-ix-slave12: comment 83 A1-16184 4785 scl1 linux64-ix-slave13: comment 83 A1-16185 4786 scl1 linux64-ix-slave16: comment 78 A1-16188 4789 scl1 linux64-ix-slave41: comment 101 A1-16213 4814 scl1 mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 w32-ix-slave23: comment 51 A1-16069 4617 scl1 w32-ix-slave41: bug 615744 A1-16104 4705 scl1 w64-ix-slave02: bug 638814 A1-16107 4708 scl1 w64-ix-slave06: bug 639628#c22 A1-16111 4712 scl1 w64-ix-slave07: comment 78 A1-16112 4713 scl1 w64-ix-slave11: comment 78 A1-16116 4717 scl1 Starting to power down the machines that are not in production per comment 86.
Added linux-ix-slave40 linux-ix-slave01: bug 624371 A1-16072 4620 scl1 linux-ix-slave06: bug 624210 A1-16077 4625 scl1 linux-ix-slave13: bug 619624 A1-16084 4632 scl1 linux-ix-slave16: comment 8 A1-16087 4635 scl1 linux-ix-slave17: comment 55 A1-16088 4636 scl1 linux-ix-slave33: bug 620124 A1-16163 4674 scl1 linux-ix-slave34: comment 58 A1-16164 4675 scl1 linux-ix-slave35: bug 620124 A1-16165 4676 scl1 linux-ix-slave42: bug 624207 A1-16172 4773 scl1 linux64-ix-slave04: comment 78 A1-16176 4777 scl1 linux64-ix-slave10: comment 78 A1-16182 4783 scl1 linux64-ix-slave11: comment 78 A1-16183 4784 scl1 linux64-ix-slave12: comment 83 A1-16184 4785 scl1 linux64-ix-slave13: comment 83 A1-16185 4786 scl1 linux64-ix-slave16: comment 78 A1-16188 4789 scl1 linux64-ix-slave40: comment 102 A1-16212 4813 scl1 linux64-ix-slave41: comment 101 A1-16213 4814 scl1 mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 w32-ix-slave23: comment 51 A1-16069 4617 scl1 w32-ix-slave41: bug 615744 A1-16104 4705 scl1 w64-ix-slave02: bug 638814 A1-16107 4708 scl1 w64-ix-slave06: bug 639628#c22 A1-16111 4712 scl1 w64-ix-slave07: comment 78 A1-16112 4713 scl1 w64-ix-slave11: comment 78 A1-16116 4717 scl1
All of the scl1 machines have been pulled and are awaiting pickup by iX. (should happen today)
This list has been a bit leaky, and we're slowly finding other machines with similar problems. Coming back to this bug later will be a major headache, so we should probably track which machines have seen this treatment in the inventory somehow.
I just noticed that comment 103 suggests we sent the four in-production slaves back as well (linux-ix-slave01, 06, 13, and 16). Is that the case?
(In reply to comment #105) > I just noticed that comment 103 suggests we sent the four in-production slaves > back as well (linux-ix-slave01, 06, 13, and 16). Is that the case? We did. It was the opinion of buildduty at the time that the mtv1 slaves were running well enough post-firewall changes that we could afford to lose the production machines. To your point in comment 104, I'll add the return-from-repair date to the notes field in inventory. As you find other machines, where are you noting them?
(In reply to comment #106) > We did. It was the opinion of buildduty at the time that the mtv1 slaves were > running well enough post-firewall changes that we could afford to lose the > production machines. Excellent, good to know. > To your point in comment 104, I'll add the return-from-repair date to the notes > field in inventory. As you find other machines, where are you noting them? They're slowly getting bugs that aren't actively being duped here. bug 643397 is the first (aside from w64-ix-slave23, the dropping of which from this bug I still haven't seen explained?).
(In reply to comment #107) > They're slowly getting bugs that aren't actively being duped here. bug 643397 > is the first That was filed at quarter to six this morning, it's now just after 9. We may have a small gap on the definition of 'active'. :D That said, there's an opportunity here. New bug for the next batch of failures? > (aside from w64-ix-slave23, the dropping of which from this bug I > still haven't seen explained?). comment 79?
(In reply to comment #108) > (In reply to comment #107) > > > They're slowly getting bugs that aren't actively being duped here. bug 643397 > > is the first > > That was filed at quarter to six this morning, it's now just after 9. We may > have a small gap on the definition of 'active'. :D That said, there's an > opportunity here. New bug for the next batch of failures? "not .. actively" mean't I'm not doing it - sorry for any ambiguity here. If you'd like a new tracker bug for the next batch, that sounds good. Open one up and copy me? We'll probably continue to have separate bugs to dupe into it, and won't dupe until we've verified (a) the machine hasn't already had its fans fixed and (b) the issue has a decent probability of being fan-related. > > (aside from w64-ix-slave23, the dropping of which from this bug I > > still haven't seen explained?). > > comment 79? Math is hard. Let's go shopping! /me shuts up about that slave.
It looks like the ETA on the repair run in comment 103 is next Tuesday, March 29. Aki, if they're not back by then, let's put in another call to IX.
Update from iX Systems on 3/28: Just wanted to give you a brief update on the systems we currently have. It appears most of the short-depth servers will be receiving replacement fans and we did find a few systems with failed components as well. We expect the new parts to be in this week and are anticipating to have the systems returned to you early next week. As for the current state of each system, I have compiled a status list for you below; which also includes your asset ID's, as well as our serial numbers. A1-16077 - 04625 - Marginal Fan A1-16213 - 04814 - Marginal Fan A1-16185 - 04786 - Marginal Fan A1-16087 - 04635 - Marginal Fan A1-16084 - 04632 - Marginal Fan A1-16182 - 04783 - Marginal Fan A1-16069 - 04617 - Marginal Fan A1-16188 - 04789 - Marginal Fan A1-16072 - 04620 - Marginal Fan A1-16164 - 04765 - Marginal Fan A1-16212 - 04813 - Marginal Fan A1-16183 - 04784 - Marginal Fan A1-14132 - 03121 - Marginal Fan A1-16053 - 04601 - Marginal Fan A1-16176 - 04777 - Marginal Fan A1-16107 - 04708 - Marginal Fan A1-16184 - 04785 - Marginal Fan A1-15844 - No Mozilla ID? - Failed Drive (WD6000BLHX-01V7BV0) A1-16104 - 04705 - Marginal Fan & Failed Drive (ST325018AS) A1-16116 - 04717 - Marginal Fan & Failed Drive (ST325018AS) A1-16165 - 04766 - Marginal Fan & Failed Drive (ST325018AS) A1-16163 - 04764 - Marginal Fan & Failed Drive (ST325018AS) A1-16111 - 04712 - Marginal Fan & Uncorrectable MCE (Motherboard) A1-16054 - 04602 - Marginal Fan & Uncorrectable MCE (Motherboard) A1-16172 - 04773 - Marginal Fan & Correctable MCE (Memory) A1-16088 - 04636 - No Problems Found, Additional Testing in Process. A1-16112 - 04713 - No Problems Found, Additional Testing in Process.
(In reply to comment #111) > Update from iX Systems on 3/28: > ... > We expect the new > parts to be in this week and are anticipating to have the systems returned to > you early next week. Any news?
These were re-racked and re-imaged last night. I'll track bringing those slaves back in bug 650335. I have the following slaves marked as being down and awaiting a second repair trip: mv-moz2-linux-ix-slave12: A1-14132 3121 mtv1 w32-ix-slave07: A1-16053 4601 mtv1 w32-ix-slave08: bug 635416#c31 A1-16054 4602 mtv1 So let's start piling on to put together a second list and send it off.
At zandr's discretion, we could add to the systems in comment 113: linux-ix-slave29 (bug 643397) linux64-ix-slave03 (bug 648528) buildbot-master1 (bug 644991) linux64-ix-slave35 (bug 648312) It would also be good to test the machines in bug 637973 before they get activated in their new home.
So, the machines in comment 113 are actually back from repair. w32-ix-slave07 and 08 were installed in SCL1 when they came back, and mv-moz2-linux-ix-slave12 is sitting on the bench in SCL1 waiting for a ride home.
per meeting with zandr yesterday: (was also discussed last week, but I couldnt find this noted in any other bug, so adding here). 1) on Tuesday (21stjune) ix systems tried changing fan/cooling in 4 of the ix machines in colo. 3 of 4 machines started to work much better. zandr to reconvene with vendor and decide what next. 2) we've already verified that the ix machines in 650castro are from both batch1 and batch2. This rules out the "batch2 has all bad disks" theory. 3) New theory is about vibration because of difference in floor, rack or chassis between the two locations. 3a) zandr to try moving "working" ix machines from here to scl1 and see if the machine stops working. 3b) zandr to try moving a "broken" ix machine from scl1 to 650castro to see if the machine starts working. 3c) zandr to confirm with the colo vendor how the racks are mounted to/through the raised floor (some side theories about whether the vibration problems for disks are caused by chassis design, how chassis is mounted to racks, or how racks are mounted to floor.)
(In reply to comment #116) > 1) on Tuesday (21stjune) ix systems tried changing fan/cooling in 4 of the > ix machines in colo. 3 of 4 machines started to work much better. zandr to > reconvene with vendor and decide what next. 10, actually. 4 had been tested as of the conversation Thursday. Worked through the rest this evening. Excerpted from the update I just sent to iX: We have successfully tested 7 of the 10 machines. Of those 7, 6 are within spec: 4712/A1-16111 119MB/s (WD RE4) 4719/A1-16117 128MB/s (WD RE4) 4731/A1-16130 129MB/s (WD RE4) 4735/A1-16134 127MB/s (Seagate Barracuda 7200.12) 4739/A1-16138 117MB/s (Seagate Barracuda 7200.12) 4743/A1-16142 121MB/s (Seagate Barracuda 7200.12) One is slow: 4715/A1-16114 47MB/s (WD RE4) Two have hardware problems: 4708/A1-16107 This one still turns itself off during boot. It was returned for repair with this complaint when you were on site, still has the problem. 4747/A1-16146 Neither the BIOS nor the OS detect a hard disk at all. I don't believe this machine has been back for repair yet, so it could simply be a bad Seagate drive. 4809/A1-16208 I can't reach this machine at all (host or IPMI) I'll verify power and network next time I'm on site. (could be tonight) So we still have one machine that wasn't fixed by the new heatsink/fan assembly. Not sure what to make of that. The results otherwise are encouraging, even the Seagate Desktop drives seem happy. > 3) New theory is about vibration because of difference in floor, rack or > chassis between the two locations. Theory has always been vibration. Slab vs. raised floor is an interesting factor. > 3a) zandr to try moving "working" ix machines from here to scl1 and see if > the machine stops working. We have certainly moved lots of iX machines from mtv1 to scl1. I'll see if I can find out which ones (wasn't it all of them?) were previously running well in mtv1. > 3b) zandr to try moving a "broken" ix machine from scl1 to 650castro to see > if the machine starts working. An entertaining, if academic, exercise. Given that we're getting good results from the new HSF arrangement, this is a low priority. mtv1 (not an HA site) and scl2 (with one exception, no releng infra) are our only slab-floor sites. scl3 will be raised floor. > 3c) zandr to confirm with the colo vendor how the racks are mounted > to/through the raised floor As discussed on Thursday, the racks are bolted through the floor tiles to unistrut, and the ends of each row of racks are anchored to the floor, again with unistrut.
The testing we've done seems to validate the fix. I suspect that 4715 is being affected by its neighbors. I've sent it back to iX, but I have every reason to expect that it will test fine at iX. 4708 and 4747 are also at iX for repair. The plan going forward: iX will order replacement heatsinks for all the machines. Once we have a delivery date, I'll set up a day or two with iX where we'll replace the heatsinks en masse. iX will set up with a couple of guys on-site in scl1. Mozilla will provide 3-4 folks to manage moving machines in and out of racks. Dustin will take a set of machines out of service to 'prime the pump'. Once those are ready to come out, we'll start pulling machines out of the rack and handing them off to iX. iX will install the new HSF and the upgraded memory, and hand the machines back to us. We'll get them racked back up, and hand them off to arr/Dustin for a quick smoke test and return to service. As those machines come online, new machines will be finishing builds and ready to come out for upgrade. I expect the downtime for any given machine will be on the order of 15-30 minutes, and in the name of pipelining we might have 5-10 machines off at a time. We can pull machines in any order as they become idle. Paul (from iX) and I did 8 machines like this in something like 45min. If we can get two workflows running in parallel, we should be able to get this work done in one or two days. ("We" in this case is Zandr plus one or two folks from ops, staffing TBD)
(In reply to comment #117) > (In reply to comment #116) > > 1) on Tuesday (21stjune) ix systems tried changing fan/cooling in 4 of the > > ix machines in colo. 3 of 4 machines started to work much better. zandr to > > reconvene with vendor and decide what next. > > 10, actually. 4 had been tested as of the conversation Thursday. Worked > through the rest this evening. Excerpted from the update I just sent to iX: > > We have successfully tested 7 of the 10 machines. Of those 7, 6 are within > spec: > > 4712/A1-16111 119MB/s (WD RE4) > 4719/A1-16117 128MB/s (WD RE4) > 4731/A1-16130 129MB/s (WD RE4) > 4735/A1-16134 127MB/s (Seagate Barracuda 7200.12) > 4739/A1-16138 117MB/s (Seagate Barracuda 7200.12) > 4743/A1-16142 121MB/s (Seagate Barracuda 7200.12) > > One is slow: > 4715/A1-16114 47MB/s (WD RE4) > > Two have hardware problems: > 4708/A1-16107 This one still turns itself off during boot. It was returned > for repair with this complaint when you were on site, still has the problem. > > 4747/A1-16146 Neither the BIOS nor the OS detect a hard disk at all. I don't > believe this machine has been back for repair yet, so it could simply be a > bad Seagate drive. > > 4809/A1-16208 I can't reach this machine at all (host or IPMI) I'll verify > power and network next time I'm on site. (could be tonight) > > So we still have one machine that wasn't fixed by the new heatsink/fan > assembly. Not sure what to make of that. The results otherwise are > encouraging, even the Seagate Desktop drives seem happy. > If I read this correctly, of the 10 machines with new heatsinks: 6 fixed 1 still slow 1 unable to boot 1 bad drive 1 unreachable Its unclear if I should be worried by this 40% fail rate. Is the plan to understand the 40% before we start the work in bug#668395? Or is the plan to do the work in bug#668395 so that at least 60% are ok, and then investigate the remaining 40%?
(In reply to comment #119) > Its unclear if I should be worried by this 40% fail rate. You should not, because 3 of the 4 failures are entirely unrelated to the heatsink. The only one that would have anything to do with the heatsink is the one that's still slow. > Is the plan to understand the 40% before we start the work in bug#668395? Yes, 3 of the 4 were dropped off at iX, as described in the first paragraph of comment #118: > I suspect that 4715 is > being affected by its neighbors. I've sent it back to iX, but I have every > reason to expect that it will test fine at iX. 4708 and 4747 are also at iX > for repair. Returning to comment #119: > Or is the plan to do the work in bug#668395 so that at least 60% are ok, and > then investigate the remaining 40%? I don't think extrapolating that failure rate is in any way valid.
I'm quite confident, but in an effort to be overly cautious, I'm doing some extra digging. First, I've noted a lot of spurious-interrupt log messages on the six repaired systems in comment 117, but on investigation also found them on fully-functional systems. I have written some explanation up here: https://bugzilla.mozilla.org/show_bug.cgi?id=652962 or just see http://lkml.org/lkml/2006/12/27/174 Second, I finally found a decent IO stress tool: http://weather.ou.edu/~apw/projects/stress/ These are now running in screen sessions on all of the six fixed hosts from comment 117. I've already checked hdparm times, and they are consistently high. I'll keep an eye on these and report any problems.
(In reply to comment #120) > (In reply to comment #119) > > > Its unclear if I should be worried by this 40% fail rate. > You should not, because 3 of the 4 failures are entirely unrelated to the > heatsink. The only one that would have anything to do with the heatsink is > the one that's still slow. > > Is the plan to understand the 40% before we start the work in bug#668395? > > Yes, 3 of the 4 were dropped off at iX, as described in the first paragraph > of comment #118: ok, so 3 of the 10 had unrelated hardware problems. What about the 1 of 10 which didnt have known hardware problems, yet wasnt fixed by the heatsink change? (I agree having 6 fixed by the heatsink change is a big improvement over today. I just want to make sure we understand all the details and have the same expectations before the proposed big-bulk-repair project starts.)
> What about the 1 of 10 which didnt have known hardware problems, yet wasnt > fixed by the heatsink change? Specifically which one are you referring to?
John, it sounds like you're asking about 4715, and that's already in the bug (twice, actually). From comment 118 and comment 120: > The testing we've done seems to validate the fix. I suspect that 4715 > is being affected by its neighbors. I've sent it back to iX, but I > have every reason to expect that it will test fine at iX. To the broader point, yes, we understand all the details and have looked (in rather excruciating depth) at the worst- and best-case scenarios here. There's stress-testing ongoing in bug 668395. Some portion of the known-bad hosts may still be bad after the HSF repair. The experience with this set of 10 informs the confidence intervals - in my mind, worst case is 80% success rate, best case is 95%, but reasonable people support other numbers :) As for known-good hardware, it's hard to see how this change could have a significant negative effect, but we will be watching for such during the repair work. If you're interested in more details, I can supply those to you offline, but the important point in the bug is that we have considered them.
I haven't seen any ill effects from the stress testing, and read speeds per hdparm are still in the low 100's, even with the stress tests running. The test in question is cd /builds/ && ./stress --io 2 --hdd 2 --hdd-bytes 10GB by the way. I'll leave the tests running over the (5-day for me) weekend.
(In reply to comment #123) > > What about the 1 of 10 which didnt have known hardware problems, yet wasnt > > fixed by the heatsink change? > > Specifically which one are you referring to? Per meeting with zandr on thursday: 1) the only machine that didnt improve after fixing was 4715. 2) the state of 4715 is not a concern to zandr because he believes it is being impacted by the vibrations of its neighbors. zandr believes that once all the machines have their fans replaced, this collective vibration will reduce and 4715 will start to work properly. This info was not in previous responses, hence the re-ask.
(In reply to comment #126) > (In reply to comment #123) > > > What about the 1 of 10 which didnt have known hardware problems, yet wasnt > > > fixed by the heatsink change? > > > > Specifically which one are you referring to? > > Per meeting with zandr on thursday: > > 1) the only machine that didnt improve after fixing was 4715. > > 2) the state of 4715 is not a concern to zandr because he believes it is > being impacted by the vibrations of its neighbors. zandr believes that once > all the machines have their fans replaced, this collective vibration will > reduce and 4715 will start to work properly. This info was not in previous > responses, hence the re-ask. 3) Replacing the fans is expected to solve the machine vibration issue. Hence, the drives would explicitly not be upgraded as had been proposed earlier. (A few replacement drives will be on hand, in case dead machines are discovered during the big fan-heatsink-upgrade, but thats a like-with-like replacement.)
I just peeked in on w64-ix-slave25 and its stress testing seems to be doing just fine.
I've stopped the tests now. All were still running without errors.
Updates from iX Systems on the three problem machines: A1-16114 / 04715 - This system's drive was exhibiting fluctuating numbers, so we ended up replacing the drive. Since the swap, the system has been running steady around 130 Mbps. A1-16107 / 04708 - It took some time but we managed to get this system to reboot. You mentioned it would shut itself down completely? Since the reboot was occurring during our PXE kick-start, we suspected the memory and swapped all of the dimms. The system has been burning fine since and is continuing to exhibit stability. A1-16146 / 04747 - This system had a troublesome drive, so we replaced it with a new unit. We've also upgraded the memory in this system and it is exhibiting stability as well.
Great! Will IX bring those along when we do bug 668395, or sooner?
The problem machines in comment 130 were returned and racked today, and will get new bugs for setup. More to the point, bug 668395 was completed (it's still open for questions, but it's done), which means that this bug is complete.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
We'll need a new bug for the mtv1 iX hosts.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #133) > We'll need a new bug for the mtv1 iX hosts. Yes, but this bug is not dependent on that one, as there are no slow or failing drives in mtv1. Resolved/Fixed is correct, the slow and failing drives are fixed. Converting the machines in mtv1 is for completeness/consistency, not to fix any operational issues.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: